This is the third post of a series of three. I’ve showed how I randomise AMC exams, how I randomise C code to ask questions about it, and now I present how I automatically assess students programs.
DISCLAMER: don’t expect a magic solution to this long standing problem. What I present works for me because I have a specific context: I need to assess a lot of short programs in the context of an algorithm lecture, and I can anticipate quite well what has to be produced.
What I need to check, and how it’s solved:
- Do programs declare the expected functions? This is done by parsing the code and searching through it.
- Do required functions exhibit the right structures? For instance a conditional nested into a loop for a search function. This is done by performing xpath queries on the code.
- Do programs compile correctly? This is easy to automate.
- Do programs behave correctly on chosen examples?
This is done by replacing the
main
function with functional tests. - Are some programs too much similar to others? This is done by computing compression distances between each pair of programs.
The two first point need to be done even on programs that do not compile.
Fortunately, we have the clang
parser that is quite robust with respect to poorly written code.
Then, running students programs may be dangerous for my computer, to do it securely, I use firejail
to lock programs into a secured sandbox.
All these needs have been packed into tool badass
that I’ve designed for this purpose exactly.
It’s in very early beta state, but I’m using it on actual code assessment and it does what I need.
In the rest of this post, I’ll illustrate the various aspects listed above on concrete examples (not necessarily in the order presented above).
Generating the exam
I’m teaching algorithmic in C to first years students. I’ll illustrate this post with a very simple, yet real exercise: change the case of a string, in several variants (to lower/to upper/swap, on all letters/consonants/vowels).
The question asked is to fulfil a given program that is generated using badass.p5
.
The purpose of this program is to provide a bootstrap as well as a test case for the students.
//let head = randstring(adjectives)
//let tail = randstring(animals)
//import random, qnd
//if random.randint(0,1)
//let STR = String(head.upper() + " " + tail.lower())
//else
//let STR = String(head.lower() + " " + tail.upper())
//end if
//let algo = choice("strlow", "strup", "strswap", "strclow", "strcup", "strcswap", "strvlow", "strvup", "strvswap")
//let RESULT = String(getattr(qnd, algo)(STR))
//let FUN = randname()
//let ARG = randname()
void FUN (char* ARG) {
// write this function body
}
//let TEST = randname()
int main (void) {
char TEST[] = STR;
FUN(TEST);
printf("%s\n", TEST); // prints RESULT
}
Note that class badass.p5.String
is like a Python str
except that its __str__
method returns a literal string as required by C.
On this repository, I’ve included:
- the files generated with
badass.p5.PrePreProcessor.save
(see previous post) in directorytodo
, one C fie for each - the real students’ answers from the exam passed this year in directory
done
, one directory with onemain.c
file for each
Detect similar projects
badass
includes simple plagiarism detection through the computation of compression distances (in French).
This is not intended to replace real plagiarism detection tools like MOSS or Baldr (in French), but it’ll give a quick overview of what projects are suspiciously close one to others, so you can quickly spot where manual inspection is needed.
So let’s run:
$ badass compare --csv=compare.csv done/*
$ badass compare --load=compare.csv --heatmap=compare.png --maxsize=50
The first command computes a distance matrix and save it to CSV file compare.csv
.
This file is reloaded in the second command.
Having this intermediary CSV file is a way to save computation time for future commands since computing compression distances make take a while (more than 12 minutes on this instance using my laptop that’s quite fast).
Option --heatmap
generates the picture below showing how all project compare to each other, option --maxsize
splits this picture into more readable ones using the dendogram computed for the whole heatmap.
The split heatmaps are on the repository. Let’s have a look at this one:
We can clearly see two clusters, each of which has 7 projects that are very similar one to another.
If we compare projects 154
and 8
for example, we can see indeed that this is almost the same code.
That’s all with plagiarism detection, let’s go back to automated assessment.
Search for the expected declarations
With badass has
we can check whether some file declare some expected function.
For instance, exam #1 should declare void met(char* rouse)
to be completed by the student, so we could run:
$ badass --base=done/1 has "void met (char* rouse)@main.c"
load main.c
found void met(char *)@done/1/main.c:3
1
Which shows that the declaration has been found as expected.
Note that parameter name is ignored (it’s OK to change them) and in this code, the student has forgot to include one.
Option --base=done/1
sets the root directory of the project in from which main.c
is loaded.
For exam #24 we should have void skyed(char* aside)
declared. But…
$ badass --base=done/24 has "void skyed (char* aside)@main.c"
load main.c
not found void skyed (char* aside)@main.c
0
Actually, the student declared void* skyed (char* aside)
instead with a wrong return type.
So we may run another command to check whether a function skyed
has been declared at all:
$ badass --base=done/24 has "skyed@main.c"
load main.c
found void * skyed(char *)@done/24/main.c:6
1
So far, we see that we can search for specific functions by name, and optionally by signature. But two problems remain to be solved: how to automate this, and how to exploit this information latter on.
For automation, we’ll re-read the files todo/NUM.c
using badass.p5.PrePreProcessor
, save the parameters of each in done/NUM/bad.sh
, and write a script bad.sh
at the root of our working directory, to be called with one done/NUM/bad.sh
as its argument that will perform the tests like above.
Here is the script that generate the scripts done/NUM/bad.sh
import pathlib
import tqdm
from badass.p5 import PrePreProcessor
for tgtdir in tqdm.tqdm(list(pathlib.Path("done").glob("*"))) :
project = tgtdir.name
ppp = PrePreProcessor(f"todo/{project}.c")
with open(tgtdir / "bad.sh", "w") as out :
out.write(f"FUN='{ppp.top.cppenv['FUN']}'\n"
f"ARG='{ppp.top.cppenv['ARG']}'\n")
And here is the main bad.sh
script:
BASE=$1
. "$BASE/bad.sh"
PROJ=$(basename "$BASE")
badass --record="$PROJ:good_decl" --base="$BASE" has "void $FUN (char* $ARG)@main.c"
badass --record="$PROJ:bad_decl" --base="$BASE" has "$FUN@main.c"
Note option --record
added to badass
invocation.
This is the solution to the second problem: this options tells badass
to record the result of the query in a database.
It’ll be used later on to build a report of all the tests on all the projects.
The argument of --record
is a pair of name separated by a colon, you may think about them as ROW:COLUMN
in the future report.
badass
uses a sqlite database with a simple structure, mixed with flat files whose path are referenced in the database.
Check the structure of the function
To go further, we can check whether the function has a conditional nested into a loop, which is required to perform the expected algorithm.
(Yes, students could have used recursion, we should test for it but believe me: it’s useless… And yes, they could have used gotos but that’s good if we don’t find them.)
To do so, badass
as a command xpath
that performs an xpath
query on the code source that is converted to XML.
There is no fancy conversion or analysis: it just takes the AST from clang
and dumps it to XML, with as many attributes it can interpret.
If you want to see what this XML looks like, use badass cparse -i yourfile.c
or badass cparse "some C code that yields a tree"
to see how you should construct xpath
queries.
So we search for a loop (for
, while
, or do
) with a conditional (if
, or switch
) inside, and we extend bad.sh
accordingly:
badass --record "$PROJ:for_if" --base="$BASE" xpath -c main.c \
"//function_decl[@spelling='$FUN']//for_stmt//if_stmt"
badass --record "$PROJ:while_if" --base="$BASE" xpath -c main.c \
"//function_decl[@spelling='$FUN']//while_stmt//if_stmt"
badass --record "$PROJ:do_if" --base="$BASE" xpath -c main.c \
"//function_decl[@spelling='$FUN']//do_stmt//if_stmt"
badass --record "$PROJ:for_switch" --base="$BASE" xpath -c main.c \
"//function_decl[@spelling='$FUN']//for_stmt//switch_stmt"
badass --record "$PROJ:while_switch" --base="$BASE" xpath -c main.c \
"//function_decl[@spelling='$FUN']//while_stmt//switch_stmt"
badass --record "$PROJ:do_switch" --base="$BASE" xpath -c main.c \
"//function_decl[@spelling='$FUN']//do_stmt//switch_stmt"
(Possibly I could have used |
to group these xpath
queries but I’m not use to it so I feel it’s more readable like this.)
Check compilation and perform functional tests
Last but not least, does the program compile, and does it computes the right results?
We use here badass run
command that works as follows:
- it collects various files, including two scripts
prepare.sh
andbuild.sh
that I’ll describe below - it starts a
firejail
sandbox with a overlay filesystem, so every change on files and directories made from this point will be cancelled on exit. Moreover, every process run from this point will be sandboxed and won’t be able to harm your system - it copies the collected files together with the student’s files
- it runs
prepare.sh
that should create the source code to be assessed (more details below) - it runs
build.sh
that should compile the prepared source code (more details below) - it runs the compiled program inside
strace
to record every system call made, tracing also the children processes if any (we won’t use these traces here) - it collects this trace as well as the result of assessments made using
badass.h
(more details below) - for every process run, its return code, standard output, and standard error, are also recorded
firejail
exists and everything is restored as it was at the beginning, except of course for the collected information that may be recorded to the database if--record
has been used
badass.h
In order to test the function result, we prepare a file test.c
that performs the functional test, and is parameterised just like the original source code:
#include "badass.h"
#include <string.h>
int main (void) {
char TEST[] = STR;
assopen("run.ass");
FUN(TEST);
int _cmp = strcmp(TEST, RESULT);
assess(test, _cmp == 0);
}
badass.h
declares macro assess
that is similar to assert
but does not stop the program when it fails.
Instead, it logs its results in the file that has been given to assopen
.
It’s first argument is a key that will be used as an identifier of each assessment to distinguish them within one run.
Note that we save the result of strcmp
in a variable because assess
is a macro that converts its arguments to strings (using the #arg
feature of CPP) which may fail when the expression itself includes a string.
Note also that we have used a literal string and not a variable, which is actually very important: my first tests were using something like
char EXPECT[] = RESULT;
char TEST[] = STR;
to declare the string to test and the expected result.
But I discovered that some badly written programs did overflow string TEST
and changed EXPECT
also, in such a way that the comparison of both was always correct even is the algorithm was completely wrong.
So, it is a good idea to use strings that do not reside in the same areas of the process memory.
In general, it turns out to be a surprisingly difficult task to reliably perform functional tests on C programs that may have memory corruption.
prepare.sh
This script is responsible for creating a source file to be compiled and tested.
In general, it replaces the main
function provided by students with one that performs the expected tests.
For the example exam, I’ve used two tests: one to check whether the students program provide the right result on the example from the exam statement; and another to check on an unexpected string to test the program more thoroughly.
First, we extend mkbad.py
so it saves more information and create files run.sh
together with bad.sh
:
import pathlib, string, random
import tqdm
import qnd
from badass.p5 import PrePreProcessor, String
random.seed(12345)
chars = list(string.ascii_letters + string.digits + "!#$%&()*+,-./:;<=>?@[]^_{|}~")
random.shuffle(chars)
STR = String("".join(chars))
for tgtdir in tqdm.tqdm(list(pathlib.Path("done").glob("*"))) :
project = tgtdir.name
ppp = PrePreProcessor(f"todo/{project}.c")
with open(tgtdir / "bad.sh", "w") as out :
out.write(f"FUN='{ppp.top.cppenv['FUN']}'\n"
f"ARG='{ppp.top.cppenv['ARG']}'\n"
f"TEST='{ppp.top.cppenv['TEST']}'\n"
f"STR='{ppp.top.cppenv['STR']}'\n"
f"RESULT='{ppp.top.cppenv['RESULT']}'\n")
RESULT = String(getattr(qnd, ppp.top.cppenv["algo"])(STR))
with open(tgtdir / "run.sh", "w") as out :
out.write(f"STR='{STR}'\n"
f"RESULT='{RESULT}'\n")
Intuitively, bad.sh
has everything we need to build a valid test.c
as in the exam statement, and run.sh
can replace two variables to perform the unexpected test.
If we want to run the test from the original exam statement, we’ll replace run.sh
with an empty file, otherwise we let it as generated.
Script prepare.sh
will thus read the student’s source, drop function main
and replace it with that from test.c
.
. bad.sh
. run.sh
badass -o run.c patch main.c del main
badass p5 -l TEST="$TEST" -l STR="$STR" -l FUN="$FUN" -l RESULT="$RESULT" test.c >> run.c
cat run.c
First we load bad.sh
and run.sh
(the latter being possibly replaced with an empty file).
Then, we use badass patch
to remove function main
from file main.c
, and then badass p5
to add our own function main
.
Everything is saved to a file run.c
that we cat
at the end so we’ll be able to inspect the source code actually used in the test.
Don’t forget that firejail
will cancel all these changes to the filesystem, but badass
saves the processes’ outputs.
Talking about firejail
, you may have an issue executing badass
from within it: it may not find libclang
(and will complain about ldconfig
being not accessible).
To work around this, configure BADASS_LIBCLANG
as explained in the installation instructions.
(This show that firejail
indeed restricts the access to the system…)
build.sh
Script build.sh
is very simple:
gcc run.c
It produces a file a.out
that is expected by badass run
to execute the program.
Note that it could also produce a file argv
with one argument per line to be passed to a.out
upon execution.
In general, build.sh
could be produced or customised by the students in order to include libraries they would have needed.
Perform the test
In order to perform the test on known values, we have to overwrite run.sh
with an empty file, so we prepare the file in the root of our work directory.
Then we can run:
$ badass --base=done/1 run --out --err --ret -r badass.h prepare.sh build.sh test.c run.sh
The options are as follows:
--base
sets the base directory for the assessed project, as previously--out
tellsbadass run
to print the standard output of the three processes used: the execution ofprepare.sh
, that ofrun.sh
, and that of the program- similarly,
--err
and--ret
allow to print the standard error and return code of these processes -r
is followed by a list of files to add (and replace if the exist) in the base directory, so in particular this will overwritedone/1/run.sh
with its empty version. (We could have used option-a
that adds files but avoids to overwrite existing ones, typicallybuild.sh
is added with-a
when students may customise it.)
Note that you don’t need to have badass.h
available among you files, if it is, it will be used, otherwise, the stock version will be used transparently.
Then, badass run
starts firejail
, copies the files, and run the three processes, which result in this output:
run done/1
=== prep.ret ===
0
=== prep.out ===
#include <stdio.h>
void met (char* s) {
int i = 0;
while (s[i]) {
if (s[i] >= 'a' && s[i] <= 'z') {
s[i] += 'A' - 'a';
}
i++;
}
}
#include "badass.h"
#include <string.h>
int main (void) {
char these[] = "anxious DRAGONFLY";
assopen("run.ass");
met(these);
int _cmp = strcmp(these, "ANXIOUS DRAGONFLY");
assess(known, _cmp == 0);
}
=== prep.err ===
load main.c
--- main
=== build.ret ===
0
=== run.ret ===
0
Note that build.err
, build.out
and others are not printed because they are empty.
Note also the blank lines in run.c
that correspond to where the original function main
was: when it is deleted, its source is replaced with blank lines so that compilation error message will be consistent with the original source code.
(Another reason is that this allows to delete or add several declarations without reparsing the source code after each change.)
If we run the same command but without including run.h
in the -r
option, we’ll have the test with the more complete string:
$ badass --base=done/1 run --out --err --ret -r badass.h prepare.sh build.sh test.c
run done/1
=== prep.ret ===
0
=== prep.out ===
#include <stdio.h>
void met (char* s) {
int i = 0;
while (s[i]) {
if (s[i] >= 'a' && s[i] <= 'z') {
s[i] += 'A' - 'a';
}
i++;
}
}
#include "badass.h"
#include <string.h>
int main (void) {
char these[] = "8:tC=N>WEOm6Xd^Y)B4eqn]ifJr|z.RFh!c5<o#UK2*Pg&;LkQZa%?SG79D}@/~_jAsv$0(lTx+w,H[p{u3-IyVMb1";
assopen("run.ass");
met(these);
int _cmp = strcmp(these, "8:TC=N>WEOM6XD^Y)B4EQN]IFJR|Z.RFH!C5<O#UK2*PG&;LKQZA%?SG79D}@/~_JASV$0(LTX+W,H[P{U3-IYVMB1");
assess(ascii, _cmp == 0);
}
=== prep.err ===
load main.c
--- main
=== build.ret ===
0
=== run.ret ===
0
In order to automate and record these two tests, we just need to add these two commands at the end of our main bad.sh
(I’ve removed --out
and such options here):
badass --record="$PROJ:known" --base="$BASE" run -r badass.h prepare.sh build.sh test.c run.c
badass --record="$PROJ:ascii" --base="$BASE" run -r badass.h prepare.sh build.sh test.c
Performing all the tests now just consist in running bad.sh
in a loop.
Personally, I like to add it at the end of mkbad.py
:
import pathlib, string, random, subprocess
import qnd
from badass.p5 import PrePreProcessor, String
random.seed(12345)
chars = list(string.ascii_letters + string.digits + "!#$%&()*+,-./:;<=>?@[]^_{|}~")
random.shuffle(chars)
STR = String("".join(chars))
for tgtdir in sorted(pathlib.Path("done").glob("*")) :
project = tgtdir.name
ppp = PrePreProcessor(f"todo/{project}.c")
with open(tgtdir / "bad.sh", "w") as out :
out.write(f"FUN='{ppp.top.cppenv['FUN']}'\n"
f"ARG='{ppp.top.cppenv['ARG']}'\n"
f"TEST='{ppp.top.cppenv['TEST']}'\n"
f"STR='{ppp.top.cppenv['STR']}'\n"
f"RESULT='{ppp.top.cppenv['RESULT']}'\n")
RESULT = String(getattr(qnd, ppp.top.cppenv["algo"])(STR))
with open(tgtdir / "run.sh", "w") as out :
out.write(f"STR='{STR}'\n"
f"RESULT='{RESULT}'\n")
subprocess.check_call(["sh", "bad.sh", str(tgtdir)])
In this case, I suppress tqdm
progress bar and sort the iteration so that the output from badass run
gives enough information.
One thing could go wrong at this point: if a program enters a deadlock or a livelock.
There are two cases: its a livelock that consumes CPU, and then firejail
will kill the process after 60 seconds spent in the CPU; or the process does not consume CPU (for instance it sleeps or waits for input) in which case you’ll have to kill it by hand.
Future versions may include a watchdog on wall time, but that’s not the case yet.
The whole run will take a while because badass
is quite slow at startup and we invoke it many times, but well, this is time you don’t have to spend yourself assessing the projects.
Collect results and prepare final assessment
To produce the final assessment, we start with running badass report
that produces a file badass.csv
with all our tests in the columns and all the projects in the rows:
With this file open in a spreadsheet, you can add columns and compute a mark depending on you criteria. Mine are usually to give the maximal mark if all dynamic tests (compile and functional) pass, otherwise it take the sum of the results from dynamic tests plus 1, and multiply with a mark computed on the static tests, in such a way that the result is less than a maximum. This gives a boost to programs that compile and passe some dynamic tests, programs that just compile are not too badly marked, and program that do not even compile have a bad mark. However, I usually check manually the programs that compile but pass only one the known functional tests, for instance, this may come from functions that fill their argument with a static string as expected by the exam statement.
Whatever you do with these marks, they let you see which program actually compile, pass functional tests, or at least exhibit some required elements.
And this are the programs you may want to assess manually.
The other one are probably not worth looking at.
For instance, many colleagues of mine won’t even try to assess programs that do not compile.
badass
allows to be less rigid as it can search for structures expected from the algorithm asked to the students.