Franck Pommereau - Blog - Assess C exams automatically

This is the third post of a series of three. I’ve showed how I randomise AMC exams, how I randomise C code to ask questions about it, and now I present how I automatically assess students programs.

DISCLAMER: don’t expect a magic solution to this long standing problem. What I present works for me because I have a specific context: I need to assess a lot of short programs in the context of an algorithm lecture, and I can anticipate quite well what has to be produced.

What I need to check, and how it’s solved:

Do programs declare the expected functions? This is done by parsing the code and searching through it.
Do required functions exhibit the right structures? For instance a conditional nested into a loop for a search function. This is done by performing xpath queries on the code.
Do programs compile correctly? This is easy to automate.
Do programs behave correctly on chosen examples? This is done by replacing the main function with functional tests.
Are some programs too much similar to others? This is done by computing compression distances between each pair of programs.

The two first point need to be done even on programs that do not compile. Fortunately, we have the clang parser that is quite robust with respect to poorly written code. Then, running students programs may be dangerous for my computer, to do it securely, I use firejail to lock programs into a secured sandbox.

All these needs have been packed into tool badass that I’ve designed for this purpose exactly. It’s in very early beta state, but I’m using it on actual code assessment and it does what I need. In the rest of this post, I’ll illustrate the various aspects listed above on concrete examples (not necessarily in the order presented above).

Generating the exam

I’m teaching algorithmic in C to first years students. I’ll illustrate this post with a very simple, yet real exercise: change the case of a string, in several variants (to lower/to upper/swap, on all letters/consonants/vowels).

The question asked is to fulfil a given program that is generated using badass.p5. The purpose of this program is to provide a bootstrap as well as a test case for the students.

//let head = randstring(adjectives)
//let tail = randstring(animals)
//import random, qnd
//if random.randint(0,1)
  //let STR = String(head.upper() + " " + tail.lower())
//else
  //let STR = String(head.lower() + " " + tail.upper())
//end if
//let algo = choice("strlow", "strup", "strswap", "strclow", "strcup", "strcswap", "strvlow", "strvup", "strvswap")
//let RESULT = String(getattr(qnd, algo)(STR))
//let FUN = randname()
//let ARG = randname()
void FUN (char* ARG) {
  // write this function body
}

//let TEST = randname()
int main (void) {
  char TEST[] = STR;
  FUN(TEST);
  printf("%s\n", TEST); // prints RESULT
}

Note that class badass.p5.String is like a Python str except that its __str__ method returns a literal string as required by C.

On this repository, I’ve included:

the files generated with badass.p5.PrePreProcessor.save (see previous post) in directory todo, one C fie for each
the real students’ answers from the exam passed this year in directory done, one directory with one main.c file for each

Detect similar projects

badass includes simple plagiarism detection through the computation of compression distances (in French). This is not intended to replace real plagiarism detection tools like MOSS or Baldr (in French), but it’ll give a quick overview of what projects are suspiciously close one to others, so you can quickly spot where manual inspection is needed.

So let’s run:

$ badass compare --csv=compare.csv done/*
$ badass compare --load=compare.csv --heatmap=compare.png --maxsize=50

The first command computes a distance matrix and save it to CSV file compare.csv. This file is reloaded in the second command. Having this intermediary CSV file is a way to save computation time for future commands since computing compression distances make take a while (more than 12 minutes on this instance using my laptop that’s quite fast). Option --heatmap generates the picture below showing how all project compare to each other, option --maxsize splits this picture into more readable ones using the dendogram computed for the whole heatmap.

heatmap and dendogram of projects compression distances

The split heatmaps are on the repository. Let’s have a look at this one:

one of the split heatmaps

We can clearly see two clusters, each of which has 7 projects that are very similar one to another. If we compare projects 154 and 8for example, we can see indeed that this is almost the same code.

projects 154 and 8 side-by-side

That’s all with plagiarism detection, let’s go back to automated assessment.

Search for the expected declarations

With badass has we can check whether some file declare some expected function. For instance, exam #1 should declare void met(char* rouse) to be completed by the student, so we could run:

$ badass --base=done/1 has "void met (char* rouse)@main.c"
load main.c
found void met(char *)@done/1/main.c:3
1

Which shows that the declaration has been found as expected. Note that parameter name is ignored (it’s OK to change them) and in this code, the student has forgot to include one. Option --base=done/1 sets the root directory of the project in from which main.c is loaded.

For exam #24 we should have void skyed(char* aside) declared. But…

$ badass --base=done/24 has "void skyed (char* aside)@main.c"
load main.c
not found void skyed (char* aside)@main.c
0

Actually, the student declared void* skyed (char* aside) instead with a wrong return type. So we may run another command to check whether a function skyed has been declared at all:

$ badass --base=done/24 has "skyed@main.c"
load main.c
found void * skyed(char *)@done/24/main.c:6
1

So far, we see that we can search for specific functions by name, and optionally by signature. But two problems remain to be solved: how to automate this, and how to exploit this information latter on.

For automation, we’ll re-read the files todo/NUM.c using badass.p5.PrePreProcessor, save the parameters of each in done/NUM/bad.sh, and write a script bad.sh at the root of our working directory, to be called with one done/NUM/bad.sh as its argument that will perform the tests like above. Here is the script that generate the scripts done/NUM/bad.sh

import pathlib
import tqdm
from badass.p5 import PrePreProcessor

for tgtdir in tqdm.tqdm(list(pathlib.Path("done").glob("*"))) :
    project = tgtdir.name
    ppp = PrePreProcessor(f"todo/{project}.c")
    with open(tgtdir / "bad.sh", "w") as out :
        out.write(f"FUN='{ppp.top.cppenv['FUN']}'\n"
                  f"ARG='{ppp.top.cppenv['ARG']}'\n")

And here is the main bad.sh script:

BASE=$1
. "$BASE/bad.sh"
PROJ=$(basename "$BASE")

badass --record="$PROJ:good_decl" --base="$BASE" has "void $FUN (char* $ARG)@main.c"
badass --record="$PROJ:bad_decl" --base="$BASE" has "$FUN@main.c"

Note option --record added to badass invocation. This is the solution to the second problem: this options tells badass to record the result of the query in a database. It’ll be used later on to build a report of all the tests on all the projects. The argument of --record is a pair of name separated by a colon, you may think about them as ROW:COLUMN in the future report. badass uses a sqlite database with a simple structure, mixed with flat files whose path are referenced in the database.

Check the structure of the function

To go further, we can check whether the function has a conditional nested into a loop, which is required to perform the expected algorithm. (Yes, students could have used recursion, we should test for it but believe me: it’s useless… And yes, they could have used gotos but that’s good if we don’t find them.) To do so, badass as a command xpath that performs an xpath query on the code source that is converted to XML. There is no fancy conversion or analysis: it just takes the AST from clang and dumps it to XML, with as many attributes it can interpret. If you want to see what this XML looks like, use badass cparse -i yourfile.c or badass cparse "some C code that yields a tree" to see how you should construct xpath queries. So we search for a loop (for, while, or do) with a conditional (if, or switch) inside, and we extend bad.sh accordingly:

badass --record "$PROJ:for_if" --base="$BASE" xpath -c main.c \
    "//function_decl[@spelling='$FUN']//for_stmt//if_stmt"
badass --record "$PROJ:while_if" --base="$BASE" xpath -c main.c \
    "//function_decl[@spelling='$FUN']//while_stmt//if_stmt"
badass --record "$PROJ:do_if" --base="$BASE" xpath -c main.c \
    "//function_decl[@spelling='$FUN']//do_stmt//if_stmt"
badass --record "$PROJ:for_switch" --base="$BASE" xpath -c main.c \
    "//function_decl[@spelling='$FUN']//for_stmt//switch_stmt"
badass --record "$PROJ:while_switch" --base="$BASE" xpath -c main.c \
    "//function_decl[@spelling='$FUN']//while_stmt//switch_stmt"
badass --record "$PROJ:do_switch" --base="$BASE" xpath -c main.c \
    "//function_decl[@spelling='$FUN']//do_stmt//switch_stmt"

(Possibly I could have used | to group these xpath queries but I’m not use to it so I feel it’s more readable like this.)

Check compilation and perform functional tests

Last but not least, does the program compile, and does it computes the right results? We use here badass run command that works as follows:

it collects various files, including two scripts prepare.sh and build.sh that I’ll describe below
it starts a firejail sandbox with a overlay filesystem, so every change on files and directories made from this point will be cancelled on exit. Moreover, every process run from this point will be sandboxed and won’t be able to harm your system
it copies the collected files together with the student’s files
it runs prepare.sh that should create the source code to be assessed (more details below)
it runs build.sh that should compile the prepared source code (more details below)
it runs the compiled program inside strace to record every system call made, tracing also the children processes if any (we won’t use these traces here)
it collects this trace as well as the result of assessments made using badass.h (more details below)
for every process run, its return code, standard output, and standard error, are also recorded
firejail exists and everything is restored as it was at the beginning, except of course for the collected information that may be recorded to the database if --record has been used

`badass.h`

In order to test the function result, we prepare a file test.c that performs the functional test, and is parameterised just like the original source code:

#include "badass.h"
#include <string.h>

int main (void) {
  char TEST[] = STR;
  assopen("run.ass");
  FUN(TEST);
  int _cmp = strcmp(TEST, RESULT);
  assess(test, _cmp == 0);
}

badass.h declares macro assess that is similar to assert but does not stop the program when it fails. Instead, it logs its results in the file that has been given to assopen. It’s first argument is a key that will be used as an identifier of each assessment to distinguish them within one run. Note that we save the result of strcmp in a variable because assess is a macro that converts its arguments to strings (using the #arg feature of CPP) which may fail when the expression itself includes a string. Note also that we have used a literal string and not a variable, which is actually very important: my first tests were using something like

char EXPECT[] = RESULT;
char TEST[] = STR;

to declare the string to test and the expected result. But I discovered that some badly written programs did overflow string TEST and changed EXPECT also, in such a way that the comparison of both was always correct even is the algorithm was completely wrong. So, it is a good idea to use strings that do not reside in the same areas of the process memory. In general, it turns out to be a surprisingly difficult task to reliably perform functional tests on C programs that may have memory corruption.

`prepare.sh`

This script is responsible for creating a source file to be compiled and tested. In general, it replaces the main function provided by students with one that performs the expected tests. For the example exam, I’ve used two tests: one to check whether the students program provide the right result on the example from the exam statement; and another to check on an unexpected string to test the program more thoroughly.

First, we extend mkbad.py so it saves more information and create files run.sh together with bad.sh:

import pathlib, string, random
import tqdm
import qnd
from badass.p5 import PrePreProcessor, String

random.seed(12345)

chars = list(string.ascii_letters + string.digits + "!#$%&()*+,-./:;<=>?@[]^_{|}~")
random.shuffle(chars)
STR = String("".join(chars))

for tgtdir in tqdm.tqdm(list(pathlib.Path("done").glob("*"))) :
    project = tgtdir.name
    ppp = PrePreProcessor(f"todo/{project}.c")
    with open(tgtdir / "bad.sh", "w") as out :
        out.write(f"FUN='{ppp.top.cppenv['FUN']}'\n"
                  f"ARG='{ppp.top.cppenv['ARG']}'\n"
                  f"TEST='{ppp.top.cppenv['TEST']}'\n"
                  f"STR='{ppp.top.cppenv['STR']}'\n"
                  f"RESULT='{ppp.top.cppenv['RESULT']}'\n")
    RESULT = String(getattr(qnd, ppp.top.cppenv["algo"])(STR))
    with open(tgtdir / "run.sh", "w") as out :
        out.write(f"STR='{STR}'\n"
                  f"RESULT='{RESULT}'\n")

Intuitively, bad.sh has everything we need to build a valid test.c as in the exam statement, and run.sh can replace two variables to perform the unexpected test. If we want to run the test from the original exam statement, we’ll replace run.sh with an empty file, otherwise we let it as generated.

Script prepare.sh will thus read the student’s source, drop function main and replace it with that from test.c.

. bad.sh
. run.sh

badass -o run.c patch main.c del main
badass p5 -l TEST="$TEST" -l STR="$STR" -l FUN="$FUN" -l RESULT="$RESULT" test.c >> run.c

cat run.c

First we load bad.sh and run.sh (the latter being possibly replaced with an empty file). Then, we use badass patch to remove function main from file main.c, and then badass p5 to add our own function main. Everything is saved to a file run.c that we cat at the end so we’ll be able to inspect the source code actually used in the test. Don’t forget that firejail will cancel all these changes to the filesystem, but badass saves the processes’ outputs.

Talking about firejail, you may have an issue executing badass from within it: it may not find libclang (and will complain about ldconfig being not accessible). To work around this, configure BADASS_LIBCLANG as explained in the installation instructions. (This show that firejail indeed restricts the access to the system…)

`build.sh`

Script build.sh is very simple:

gcc run.c

It produces a file a.out that is expected by badass run to execute the program. Note that it could also produce a file argv with one argument per line to be passed to a.out upon execution. In general, build.sh could be produced or customised by the students in order to include libraries they would have needed.

Perform the test

In order to perform the test on known values, we have to overwrite run.sh with an empty file, so we prepare the file in the root of our work directory. Then we can run:

$ badass --base=done/1 run --out --err --ret -r badass.h prepare.sh build.sh test.c run.sh

The options are as follows:

--base sets the base directory for the assessed project, as previously
--out tells badass run to print the standard output of the three processes used: the execution of prepare.sh, that of run.sh, and that of the program
similarly, --err and --ret allow to print the standard error and return code of these processes
-r is followed by a list of files to add (and replace if the exist) in the base directory, so in particular this will overwrite done/1/run.sh with its empty version. (We could have used option -a that adds files but avoids to overwrite existing ones, typically build.sh is added with -a when students may customise it.)

Note that you don’t need to have badass.h available among you files, if it is, it will be used, otherwise, the stock version will be used transparently.

Then, badass run starts firejail, copies the files, and run the three processes, which result in this output:

run done/1
=== prep.ret ===
0
=== prep.out ===
#include <stdio.h>

void met (char* s) {
 int i = 0;
 while (s[i]) {
  if (s[i] >= 'a' && s[i] <= 'z') {
   s[i] += 'A' - 'a';
  }
  i++;
 }
}






#include "badass.h"
#include <string.h>

int main (void) {
  char these[] = "anxious DRAGONFLY";
  assopen("run.ass");
  met(these);
  int _cmp = strcmp(these, "ANXIOUS DRAGONFLY");
  assess(known, _cmp == 0);
}
=== prep.err ===
load main.c
 --- main
=== build.ret ===
0
=== run.ret ===
0

Note that build.err, build.out and others are not printed because they are empty. Note also the blank lines in run.c that correspond to where the original function main was: when it is deleted, its source is replaced with blank lines so that compilation error message will be consistent with the original source code. (Another reason is that this allows to delete or add several declarations without reparsing the source code after each change.)

If we run the same command but without including run.h in the -r option, we’ll have the test with the more complete string:

$ badass --base=done/1 run --out --err --ret -r badass.h prepare.sh build.sh test.c 
run done/1
=== prep.ret ===
0
=== prep.out ===
#include <stdio.h>

void met (char* s) {
 int i = 0;
 while (s[i]) {
  if (s[i] >= 'a' && s[i] <= 'z') {
   s[i] += 'A' - 'a';
  }
  i++;
 }
}






#include "badass.h"
#include <string.h>

int main (void) {
  char these[] = "8:tC=N>WEOm6Xd^Y)B4eqn]ifJr|z.RFh!c5<o#UK2*Pg&;LkQZa%?SG79D}@/~_jAsv$0(lTx+w,H[p{u3-IyVMb1";
  assopen("run.ass");
  met(these);
  int _cmp = strcmp(these, "8:TC=N>WEOM6XD^Y)B4EQN]IFJR|Z.RFH!C5<O#UK2*PG&;LKQZA%?SG79D}@/~_JASV$0(LTX+W,H[P{U3-IYVMB1");
  assess(ascii, _cmp == 0);
}
=== prep.err ===
load main.c
 --- main
=== build.ret ===
0
=== run.ret ===
0

In order to automate and record these two tests, we just need to add these two commands at the end of our main bad.sh (I’ve removed --out and such options here):

badass --record="$PROJ:known" --base="$BASE" run -r badass.h prepare.sh build.sh test.c run.c
badass --record="$PROJ:ascii" --base="$BASE" run -r badass.h prepare.sh build.sh test.c

Performing all the tests now just consist in running bad.sh in a loop. Personally, I like to add it at the end of mkbad.py:

import pathlib, string, random, subprocess
import qnd
from badass.p5 import PrePreProcessor, String

random.seed(12345)

chars = list(string.ascii_letters + string.digits + "!#$%&()*+,-./:;<=>?@[]^_{|}~")
random.shuffle(chars)
STR = String("".join(chars))

for tgtdir in sorted(pathlib.Path("done").glob("*")) :
    project = tgtdir.name
    ppp = PrePreProcessor(f"todo/{project}.c")
    with open(tgtdir / "bad.sh", "w") as out :
        out.write(f"FUN='{ppp.top.cppenv['FUN']}'\n"
                  f"ARG='{ppp.top.cppenv['ARG']}'\n"
                  f"TEST='{ppp.top.cppenv['TEST']}'\n"
                  f"STR='{ppp.top.cppenv['STR']}'\n"
                  f"RESULT='{ppp.top.cppenv['RESULT']}'\n")
    RESULT = String(getattr(qnd, ppp.top.cppenv["algo"])(STR))
    with open(tgtdir / "run.sh", "w") as out :
        out.write(f"STR='{STR}'\n"
                  f"RESULT='{RESULT}'\n")
    subprocess.check_call(["sh", "bad.sh", str(tgtdir)])

In this case, I suppress tqdm progress bar and sort the iteration so that the output from badass run gives enough information.

One thing could go wrong at this point: if a program enters a deadlock or a livelock. There are two cases: its a livelock that consumes CPU, and then firejail will kill the process after 60 seconds spent in the CPU; or the process does not consume CPU (for instance it sleeps or waits for input) in which case you’ll have to kill it by hand. Future versions may include a watchdog on wall time, but that’s not the case yet.

The whole run will take a while because badass is quite slow at startup and we invoke it many times, but well, this is time you don’t have to spend yourself assessing the projects.

Collect results and prepare final assessment

To produce the final assessment, we start with running badass report that produces a file badass.csv with all our tests in the columns and all the projects in the rows:

the generated CSV report loaded into a spreadsheet

With this file open in a spreadsheet, you can add columns and compute a mark depending on you criteria. Mine are usually to give the maximal mark if all dynamic tests (compile and functional) pass, otherwise it take the sum of the results from dynamic tests plus 1, and multiply with a mark computed on the static tests, in such a way that the result is less than a maximum. This gives a boost to programs that compile and passe some dynamic tests, programs that just compile are not too badly marked, and program that do not even compile have a bad mark. However, I usually check manually the programs that compile but pass only one the known functional tests, for instance, this may come from functions that fill their argument with a static string as expected by the exam statement.

Whatever you do with these marks, they let you see which program actually compile, pass functional tests, or at least exhibit some required elements. And this are the programs you may want to assess manually. The other one are probably not worth looking at. For instance, many colleagues of mine won’t even try to assess programs that do not compile. badass allows to be less rigid as it can search for structures expected from the algorithm asked to the students.