Corpus classes

Much of corpkit‘s functionality comes from the ability to work with Corpus and Corpus-like objects, which have methods for parsing, tokenising, interrogating and concordancing.


class corpkit.corpus.Corpus(path, **kwargs)[source]

Bases: object

A class representing a linguistic text corpus, which contains files, optionally within subcorpus folders.

Methods for concordancing, interrogating, getting general stats, getting behaviour of particular word, etc.

Unparsed, tokenised and parsed corpora use the same class, though some methods are available only to one or the other. Only unparsed corpora can be parsed, and only parsed/tokenised corpora can be interrogated.


A list-like object containing a corpus’ subcorpora.

>>> corpus.subcorpora
<corpkit.corpus.Datalist instance: 12 items>

Lazy-loaded data.


A list-like object containing the files in a folder.

>>> corpus.subcorpora[0].files
<corpkit.corpus.Datalist instance: 240 items>

Lazy-loaded data.


This removes sent index column from old corpkit data


Lazy-loaded data.

tfidf(search={'w': 'any'}, show=['w'], **kwargs)[source]

Generate TF-IDF vector representation of corpus using interrogate method. All args and kwargs go to interrogate().

Returns:Tuple: the vectoriser and matrix

Generate and show basic stats from the corpus, including number of sentences, clauses, process types, etc.

>>> corpus.features
    ..  Characters  Tokens  Words  Closed class words  Open class words  Clauses
    01       26873    8513   7308                4809              3704     2212   
    02       25844    7933   6920                4313              3620     2270   
    03       18376    5683   4877                3067              2616     1640   
    04       20066    6354   5366                3587              2767     1775

Lazy-loaded data.


Lazy-loaded data.


Lazy-loaded data.

configurations(search, **kwargs)[source]

Get the overall behaviour of tokens or lemmas matching a regular expression. The search below makes DataFrames containing the most common subjects, objects, modifiers (etc.) of ‘see’:

Parameters:search (dict) –

Similar to search in the interrogate() method.

Valid keys are:

  • W/L match word or lemma
  • F: match a semantic role (‘participant’, ‘process’ or ‘modifier’. If F not specified, each role will be searched for.
>>> see = corpus.configurations({L: 'see', F: 'process'}, show=L)
>>> see.has_subject.results.sum()
    i           452
    it          227
    you         162
    we          111
    he           94
interrogate(search='w', *args, **kwargs)[source]

Interrogate a corpus of texts for a lexicogrammatical phenomenon.

This method iterates over the files/folders in a corpus, searching the texts, and returning a corpkit.interrogation.Interrogation object containing the results. The main options are search, where you specify search criteria, and show, where you specify what you want to appear in the output.

>>> corpus = Corpus('data/conversations-parsed')
### show lemma form of nouns ending in 'ing'
>>> q = {W: r'ing$', P: r'^N'}
>>> data = corpus.interrogate(q, show=L)
>>> data.results
    ..  something  anything  thing  feeling  everything  nothing  morning
    01         14        11     12        1           6        0        1
    02         10        20      4        4           8        3        0
    03         14         5      5        3           1        0        0
    ...                                                               ...
Parameters:search (dict) –

What part of the lexicogrammar to search, and what criteria to match. The keys are the thing to be searched, and values are the criteria. To search parse trees, use the T key, and a Tregex query as the value. When searching dependencies, you can use any of:

  Match Governor Dependent Head
Word W G D H
Lemma L GL DL HL
Function F GF DF HF
Word class X GX DX HX
Distance from root A GA DA HA
Index I GI DI HI
Sentence index S SI SI SI

Values should be regular expressions or wordlists to match.

>>> corpus.interrogate({T: r'/NN.?/ < /^t/'}) # T- nouns, via trees
>>> corpus.interrogate({W: '^t': P: r'^v'}) # T- verbs, via dependencies
  • searchmode (str‘any’/‘all’) – Return results matching any/all criteria
  • exclude (dict{L: ‘be’}) – The inverse of search, removing results from search
  • excludemode (str‘any’/‘all’) – Exclude results matching any/all criteria
  • query (str, dict or list) –

    A search query for the interrogation. This is only used when search is a str, or when multiprocessing. When search If search is a str, the search criteria can be passed in as `query, in order to allow the simpler syntax:

    >>> corpus.interrogate(GL, '(think|want|feel)')

    When multiprocessing, the following is possible:

    >>> q = {'Nouns': r'/NN.?/', 'Verbs': r'/VB.?/'}
    ### return an :class:`corpkit.interrogation.Interrogation` object with multiindex:
    >>> corpus.interrogate(T, q)
    ### return an :class:`corpkit.interrogation.Interrogation` object without multiindex:
    >>> corpus.interrogate(T, q, show=C)
  • show (str/list of strings) –

    What to output. If multiple strings are passed in as a list, results will be colon-separated, in the suppled order. Possible values are the same as those for search, plus options n-gramming and getting collocates:

    Show Gloss Example
    N N-gram word The women were
    NL N-gram lemma The woman be
    NF N-gram function det nsubj root
    NP N-gram POS tag DT NNS VBN
    NX N-gram word class determiner noun verb
    B Collocate word The_were
    BL Collocate lemma The_be
    BF Collocate function det_root
    BP Collocate POS tag DT_VBN
    BX Collocate word class determiner_verb
  • lemmatise (bool) – Force lemmatisation on results. Deprecated: instead, output a lemma form with the `show` argument
  • lemmatag (‘n’/‘v’/‘a’/‘r’/False) – When using a Tregex/Tgrep query, the tool will attempt to determine the word class of results from the query. Passing in a str here will tell the lemmatiser the expected POS of results to lemmatise. It only has an affect if trees are being searched and lemmata are being shown.
  • save (str) – Save result as pickle to saved_interrogations/<save> on completion
  • gramsize (int) – Size of n-grams (default 1, i.e. unigrams)
  • multiprocess (int/bool (bool determines automatically)) – How many parallel processes to run
  • files_as_subcorpora (bool) – (Deprecated, use subcorpora=files). Treat each file as a subcorpus, ignoring actual subcorpora if present
  • conc (bool/‘only’) – Generate a concordance while interrogating, store as .concordance attribute
  • coref (bool) – Also get coreferents for search matches
  • tgrep (bool) – Use TGrep for tree querying. TGrep is less expressive than Tregex, and is slower, but can work without Java. This option may be turned on internally if Java is not found.
  • subcorpora (str/list) – Use a metadata value as subcorpora. Passing a list will create a multiindex. ‘file’ and ‘folder’/‘default’ are also possible values.
  • just_metadata (dict) – One or more metadata fields and criteria to filter sentences by. Only those matching will be kept. Criteria can be a list of words or a regular expression. Passing {'speaker': 'ENVER'} will search only sentences annotated with speaker=ENVER.
  • skip_metadata (dict) – A field and regex/list to filter sentences by. The inverse of just_metadata.
  • discard (int/float) – When returning many (i.e. millions) of results, memory can be a problem. Setting a discard value will ignore results occurring infrequently in a subcorpus. An int will remove any result occurring n times or fewer. A float will remove this proportion of results (i.e. 0.1 will remove 10 per cent)

A corpkit.interrogation.Interrogation object, with .query, .results, .totals attributes. If multiprocessing is invoked, result may be multiindexed.

sample(n, level='f')[source]

Get a sample of the corpus

  • n (int/float) – amount of data in the the sample. If an int, get n files. if a float, get float * 100 as a percentage of the corpus
  • level (str) – sample subcorpora (s) or files (f)

a Corpus object


Delete metadata for corpus. May be needed if corpus is changed


Lazy-loaded data.

parse(corenlppath=False, operations=False, copula_head=True, speaker_segmentation=False, memory_mb=False, multiprocess=False, split_texts=400, outname=False, metadata=False, coref=True, *args, **kwargs)[source]

Parse an unparsed corpus, saving to disk

  • corenlppath (str) – Folder containing corenlp jar files (use if corpkit can’t find it automatically)
  • operations (str) – Which kinds of annotations to do
  • speaker_segmentation (bool) – Add speaker name to parser output if your corpus is script-like
  • memory_mb (int) – Amount of memory in MB for parser
  • copula_head (bool) – Make copula head in dependency parse
  • split_texts – Split texts longer than n lines for parser memory
  • multiprocess (int) – Split parsing across n cores (for high-performance computers)
  • folderise (bool) – If corpus is just files, move each into own folder
  • output_format (str) – Save parser output as xml, json, conll
  • outname (str) – Specify a name for the parsed corpus
  • metadata (bool) – Use if you have XML tags at the end of lines contaning metadata
>>> parsed = corpus.parse(speaker_segmentation=True)
>>> parsed
<corpkit.corpus.Corpus instance: speeches-parsed; 9 subcorpora>
Returns:The newly created corpkit.corpus.Corpus
tokenise(postag=True, lemmatise=True, *args, **kwargs)[source]

Tokenise a plaintext corpus, saving to disk

Parameters:nltk_data_path (str) – Path to tokeniser if not found automatically
>>> tok = corpus.tokenise()
>>> tok
<corpkit.corpus.Corpus instance: speeches-tokenised; 9 subcorpora>
Returns:The newly created corpkit.corpus.Corpus
concordance(*args, **kwargs)[source]

A concordance method for Tregex queries, CoreNLP dependencies, tokenised data or plaintext.

>>> wv = ['want', 'need', 'feel', 'desire']
>>> corpus.concordance({L: wv, F: 'root'})
   0   01  1-01.txt.conll                But , so I  feel     like i do that for w
   1   01  1-01.txt.conll                         I  felt     a little like oh , i
   2   01  1-01.txt.conll   he 's a difficult man I  feel     like his work ethic
   3   01  1-01.txt.conll                      So I  felt     like i recognized li
   ...                                                                       ...

Arguments are the same as interrogate(), plus a few extra parameters:

  • only_format_match (bool) – If True, left and right window will just be words, regardless of what is in show
  • only_unique (bool) – Return only unique lines
  • maxconc (int) – Maximum number of concordance lines

A corpkit.interrogation.Concordance instance, with columns showing filename, subcorpus name, speaker name, left context, match and right context.

interroplot(search, **kwargs)[source]

Interrogate, relativise, then plot, with very little customisability. A demo function.

>>> corpus.interroplot(r'/NN.?/ >># NP')
<matplotlib figure>
  • search (dict) – Search as per interrogate()
  • kwargs (keyword arguments) – Extra arguments to pass to visualise()

None (but show a plot)

save(savename=False, **kwargs)[source]

Save corpus instance to file. There’s not much reason to do this, really.

Parameters:savename (str) – Name for the file
make_language_model(name, search={'w': 'any'}, exclude=False, show=['w', '+1mw'], **kwargs)[source]

Make a language model for the corpus

  • name (str) – a name for the model
  • kwargs (keyword arguments) – keyword arguments for the interrogate() method

a corpkit.model.MultiModel

annotate(conclines, annotation, dry_run=True)[source]

Annotate a corpus

  • conclines – a Concordance or DataFrame containing matches to annotate
  • annotation (str/dict) – a tag or field and value
  • dry_run (bool) – Show the annotations to be made, but don’t do them


unannotate(annotation, dry_run=True)[source]

Delete annotation from a corpus

Parameters:annotation (str/dict) – a tag or field and value


class corpkit.corpus.Corpora(data=False, **kwargs)[source]

Bases: corpkit.corpus.Datalist

Models a collection of Corpus objects. Methods are available for interrogating and plotting the entire collection. This is the highest level of abstraction available.

Parameters:data (str/list) – Corpora to model. A str is interpreted as a path containing corpora. A list can be a list of corpus paths or corpkit.corpus.Corpus objects. )

Parse multiple corpora

Parameters:kwargs – Arguments to pass to the parse() method.

Generate and show basic stats from the corpus, including number of sentences, clauses, process types, etc.

>>> corpus.features
    ..  Characters  Tokens  Words  Closed class words  Open class words  Clauses
    01       26873    8513   7308                4809              3704     2212   
    02       25844    7933   6920                4313              3620     2270   
    03       18376    5683   4877                3067              2616     1640   
    04       20066    6354   5366                3587              2767     1775

Lazy-loaded data.


Lazy-loaded data.


Lazy-loaded data.


class corpkit.corpus.Subcorpus(path, datatype, **kwa)[source]

Bases: corpkit.corpus.Corpus

Model a subcorpus, containing files but no subdirectories.

Methods for interrogating, concordancing and configurations are the same as corpkit.corpus.Corpus.


class corpkit.corpus.File(path, dirname=False, datatype=False, **kwa)[source]

Bases: corpkit.corpus.Corpus

Models a corpus file for reading, interrogating, concordancing.

Methods for interrogating, concordancing and configurations are the same as corpkit.corpus.Corpus, plus methods for accessing the file contents directly as a str, or as a Pandas DataFrame.


Read file data. If data is pickled, unpickle first

Returns:str/unpickled data

Return a DataFrame representation of a parsed file


Lazy-loaded data.


Lazy-loaded data.


class corpkit.corpus.Datalist(data, **kwargs)[source]

Bases: list

interrogate(*args, **kwargs)[source]

Interrogate the corpus using interrogate()

concordance(*args, **kwargs)[source]

Concordance the corpus using concordance()

configurations(search, **kwargs)[source]

Get a configuration using configurations()