Overview

corpkit comes with a dedicated interpreter, which receives commands in a natural language syntax like these:

> set mydata as corpus
> search corpus for pos matching 'JJ.*'
> call result 'adjectives'
> edit adjectives by skipping subcorpora matching 'books'
> plot edited as line chart with title as 'Adjectives'

It’s a little less powerful than the full Python API, but it is easier to use, especially if you don’t know Python. You can also switch instantly from the interpreter to the full API, so you only need the API for the really tricky stuff.

The syntax of the interpreter is based around objects, which you do things to, and commands, which are actions performed upon the objects. The example below uses the search command on a corpus object, which produces new objects, called result, concordance, totals and query. As you can see, very complex searches can be performed using an English-like syntax:

> search corpus for lemma matching '^t' and pos matching 'VB' \
...     excluding words matching 'try' \
...     showing word and dependent-word \
...     with preserve_case
> result

This shows us results for each subcorpus:

.         I/think  I/thought  and/turned  me/told  and/took  I/told   ...
chapter1        5          3           2        2         1       3   ...
chapter2        7          2           5        3         0       2   ...
chapter3        5          5           4        4         1       0   ...
chapter4        3          7           1        0         3       1   ...
chapter5        7          7           2        1         4       2   ...
chapter6        2          0           0        2         1       0   ...
chapter7        6          2           6        1         1       3   ...
chapter8        3          1           2        2         1       1   ...
chapter9        5          7           1        4         6       3   ...

Objects

The most common objects you’ll be using are:

Object Contains
corpus Dataset selected for parsing or searching
result Search output
edited Results after sorting, editing or calculating
concordance Concordance lines from search
features General linguistic features of corpus
wordclasses Distribution of word classes in corpus
postags Distribution of POS tags in corpus
lexicon Distribution of lexis in the corpus
figure Plotted data
query Values used to perform search or edit
previous Object created before last
sampled A sampled corpus
wordlists A list of wordlists for searching, editing

When you start the interpreter, these are all empty. You’ll need to run commands in order to fill them with data. You can also create your own object names using the call command.

Commands

You do things to the objects via commands. Each command has its own syntax, designed to be as similar to natural language as possible. Below is a table of common commands, an explanation of their purpose, and an example of their syntax

Command Purpose Syntax
new Make a new project new project <name>
set Set current corpus set <corpusname>
parse Parse corpus parse corpus with [options]*
search Search a corpus for linguistic feature, generate concordance search corpus for [feature matching pattern]* showing [feature]* with [options]*
edit Edit results or edited results edit result by [skipping subcorpora/entries matching pattern]* with [options]*
calculate Calculate relative frequencies, keyness, etc. calculate result/edited as operation of denominator
sort Sort results or concordance sort result/concordance by value
plot Visualise result or edited result plot result/edited as line chart with [options]*
show Show any object show object
annotate Add annotations to corpus based on search results annotate all with field as <fieldname> and value as m
unannotate Delete annotation fields from corpus unannotate <fieldname> field
sample Get a random sample of subcorpora or files from a corpus sample 5 subcorpora of corpus
call Name an object (i.e. make a variable) call object ‘<name>’
export Export result, edited result or concordance to string/file export result to string/csv/latex/file <filename>
save Save data to disk save object to <filename>
load Load data from disk load object as result
store Store something in memory store object as <name>
fetch Fetch something from memory fetch <name> as object
help Get help on an object or command help command/object
history See previously entered commands history
ipython Enter IPython with objects available ipython
py Execute Python code py ‘print(“hello world”)’
! Run a line of bash shell !ls -al data

In square brackets with asterisks are recursive parts of the syntax, which often also accept not operators. <text> denotes places where you can choose an identifier, filename, etc.

In the pages that follow, the syntax is provided for the most common commands. You can also type the name of the command with no arguments into the interpreter, in order to show usage examples.

Prompt features

  • You can use history, clear, ls and cd commands as you would in the shell
  • You can execute arbitrary bash commands by beginning the line with an exclamation point (e.g. !rm data/*)
  • You can use semicolons to put multiple commands on a line (currently needs a space before and after the semicolon)
  • There is no piping or output redirection (yet), but you can use the export and save commands to export results
  • You can use backslashes to continue writing on the next line
  • You can write scripts and pass them to the corpkit interpreter

The below is therefore a possible (but terrible) way to write code in corpkit:

> !du -h data ; set mycorp ; search corpus for words \
... matching any \
... excluding wordlists.closedclass \
... showing lemma and pos ; concordance