Corpus classes

Much of corpkit‘s functionality comes from the ability to work with Corpus and Corpus-like objects, which have methods for parsing, tokenising, interrogating and concordancing.

Corpus

class corpkit.corpus.Corpus(path, **kwargs)[source]

Bases: object

A class representing a linguistic text corpus, which contains files, optionally within subcorpus folders.

Methods for concordancing, interrogating, getting general stats, getting behaviour of particular word, etc.

document

Return the parsed XML of a parsed file

read(**kwargs)[source]

Read file data. If data is pickled, unpickle first

Returns:str/unpickled data
subcorpora

A list-like object containing a corpus’ subcorpora.

Example:
>>> corpus.subcorpora
<corpkit.corpus.Datalist instance: 12 items>
speakerlist

Lazy-loaded data.

files

A list-like object containing the files in a folder.

Example:
>>> corpus.subcorpora[0].files
<corpkit.corpus.Datalist instance: 240 items>
features

Generate and show basic stats from the corpus, including number of sentences, clauses, process types, etc.

Example:
>>> corpus.features
    ..  Characters  Tokens  Words  Closed class words  Open class words  Clauses
    01       26873    8513   7308                4809              3704     2212   
    02       25844    7933   6920                4313              3620     2270   
    03       18376    5683   4877                3067              2616     1640   
    04       20066    6354   5366                3587              2767     1775
wordclasses

Lazy-loaded data.

postags

Lazy-loaded data.

configurations(search, **kwargs)[source]

Get the overall behaviour of tokens or lemmas matching a regular expression. The search below makes DataFrames containing the most common subjects, objects, modifiers (etc.) of ‘see’:

Parameters:search – Similar to search in the interrogate() /

concordance() methods. `W/L keys match word or lemma; `F: key specifies semantic role (‘participant’, ‘process’ or ‘modifier’. If F not specified, each role will be searched for. :type search: dict

Example:
>>> see = corpus.configurations({L: 'see', F: 'process'}, show = L)
>>> see.has_subject.results.sum()
    i           452
    it          227
    you         162
    we          111
    he           94
Returns:corpkit.interrogation.Interrodict
interrogate(search, *args, **kwargs)[source]

Interrogate a corpus of texts for a lexicogrammatical phenomenon.

This method iterates over the files/folders in a corpus, searching the texts, and returning a corpkit.interrogation.Interrogation object containing the results. The main options are search, where you specify search criteria, and show, where you specify what you want to appear in the output.

Example:
>>> corpus = Corpus('data/conversations-parsed')
### show lemma form of nouns ending in 'ing'
>>> q = {W: r'ing$', P: r'^N'}
>>> data = corpus.interrogate(q, show = L)
>>> data.results
    ..  something  anything  thing  feeling  everything  nothing  morning
    01         14        11     12        1           6        0        1
    02         10        20      4        4           8        3        0
    03         14         5      5        3           1        0        0
    ...                                                               ...
Parameters:search (str or dict. dict is used when you have multiple criteria.) – What the query should be matching. - t: tree - w: word - l: lemma - p: pos - f: function - g/gw: governor - gl: governor’s lemma form - gp: governor’s pos - gf: governor’s function - d/dependent - dl: dependent’s lemma form - dp: dependent’s pos - df: dependent’s function - i/index - n/ngrams (deprecated, use show) - s/general stats

Keys are what to search as str, and values are the criteria, which is a Tregex query, a regex, or a list of words to match. Therefore, the two syntaxes below do the same thing:

Example:
>>> corpus.interrogate(T, r'/NN.?/')
>>> corpus.interrogate({T: r'/NN.?/'})
Parameters:
  • searchmode (str – ‘any’/‘all’) – Return results matching any/all criteria
  • exclude (dict – {L: ‘be’}) – The inverse of search, removing results from search
  • excludemode (str – ‘any’/‘all’) – Exclude results matching any/all criteria
  • query – A search query for the interrogation. This is only used

when search is a string, or when multiprocessing. If search is a dict, the query/queries are stored there as the values instead. When multiprocessing, the following is possible:

Example:
>>> {'Nouns': r'/NN.?/', 'Verbs': r'/VB.?/'}
### return an :class:`corpkit.interrogation.Interrodict` object:
>>> corpus.interrogate(T, q)
### return an :class:`corpkit.interrogation.Interrogation` object:
>>> corpus.interrogate(T, q, show = C)
Parameters:show – What to output. If multiple strings are passed in as a list, results

will be colon-separated, in the suppled order. If you want to show ngrams, you can’t have multiple values. Possible values are the same as those for search, plus:

  • a/distance from root
  • n/ngram
  • nl/ngram lemma
  • np/ngram POS
  • npl/ngram wordclass
Parameters:lemmatise – Force lemmatisation on results. Deprecated:

instead, output a lemma form with the show argument :type lemmatise: bool

Parameters:lemmatag – Explicitly pass a pos to lemmatiser (generally when data is unparsed),

or when tag cannot be recovered from Tregex query :type lemmatag: False/’n’/’v’/’a’/’r’

Parameters:
  • spelling (False/'US'/'UK') – Convert all to U.S. or U.K. English
  • dep_type (str -- 'basic-dependencies'/'a',) – The kind of Stanford CoreNLP dependency parses you want to use

‘collapsed-dependencies’/’b’, ‘collapsed-ccprocessed-dependencies’/’c’

Parameters:
  • save (str) – Save result as pickle to saved_interrogations/<save> on completion
  • gramsize (int) – size of n-grams (default 2)
  • split_contractions (bool) – make “don’t” et al into two tokens
  • multiprocess (int / bool (to determine automatically)) – how many parallel processes to run
  • files_as_subcorpora (bool) – treat each file as a subcorpus
  • do_concordancing (bool/'only') – Concordance while interrogating, store as .concordance attribute
  • maxconc (int) – Maximum number of concordance lines
Returns:

A corpkit.interrogation.Interrogation object, with

.query, .results, .totals attributes. If multiprocessing is invoked, result may be a corpkit.interrogation.Interrodict containing corpus names, queries or speakers as keys.

parse(corenlppath=False, operations=False, copula_head=True, speaker_segmentation=False, memory_mb=False, multiprocess=False, split_texts=400, *args, **kwargs)[source]

Parse an unparsed corpus, saving to disk

Parameters:corenlppath – folder containing corenlp jar files (use if corpkit can’t find

it automatically) :type corenlppath: str

Parameters:
  • operations (str) – which kinds of annotations to do
  • speaker_segmentation – add speaker name to parser output if your corpus is

script-like :type speaker_segmentation: bool

Parameters:
  • memory_mb (int) – Amount of memory in MB for parser
  • copula_head (bool) – Make copula head in dependency parse
  • multiprocess (int) – Split parsing across n cores (for high-performance computers)
Example:
>>> parsed = corpus.parse(speaker_segmentation = True)
>>> parsed
<corpkit.corpus.Corpus instance: speeches-parsed; 9 subcorpora>
Returns:The newly created corpkit.corpus.Corpus
tokenise(*args, **kwargs)[source]

Tokenise a plaintext corpus, saving to disk

Parameters:nltk_data_path (str) – path to tokeniser if not found automatically
Example:
>>> tok = corpus.tokenise()
>>> tok
<corpkit.corpus.Corpus instance: speeches-tokenised; 9 subcorpora>
Returns:The newly created corpkit.corpus.Corpus
concordance(*args, **kwargs)[source]

A concordance method for Tregex queries, CoreNLP dependencies, tokenised data or plaintext.

Example:
>>> wv = ['want', 'need', 'feel', 'desire']
>>> corpus.concordance({L: wv, F: 'root'})
   0   01  1-01.txt.xml                But , so I  feel     like i do that for w
   1   01  1-01.txt.xml                         I  felt     a little like oh , i
   2   01  1-01.txt.xml   he 's a difficult man I  feel     like his work ethic
   3   01  1-01.txt.xml                      So I  felt     like i recognized li
   ...                                                                       ...

Arguments are the same as interrogate(), plus:

Parameters:only_format_match – if True, left and right window will just be words, regardless of

what is in show :type only_format_match: bool

Parameters:only_unique (bool) – only unique lines
Returns:A corpkit.interrogation.Concordance instance
interroplot(search, **kwargs)[source]

Interrogate, relativise, then plot, with very little customisability. A demo function.

Example:
>>> corpus.interroplot(r'/NN.?/ >># NP')
<matplotlib figure>
Parameters:
  • search (dict) – search as per interrogate()
  • kwargs (keyword arguments) – extra arguments to pass to visualise()
Returns:

None (but show a plot)

save(savename=False, **kwargs)[source]

Save corpus class to file

>>> corpus.save(filename)
Parameters:savename (str) – name for the file
Returns:None

Corpora

class corpkit.corpus.Corpora(data=False, **kwargs)[source]

Bases: corpkit.corpus.Datalist

Models a collection of Corpus objects. Methods are available for interrogating and plotting the entire collection. This is the highest level of abstraction available.

Parameters:data (str (path containing corpora), list (of corpus paths/Corpus) – Corpora to model

objects) corpkit.corpus.Corpus objects)

features

Generate and show basic stats from the corpus, including number of sentences, clauses, process types, etc.

Example:
>>> corpus.features
    ..  Characters  Tokens  Words  Closed class words  Open class words  Clauses
    01       26873    8513   7308                4809              3704     2212   
    02       25844    7933   6920                4313              3620     2270   
    03       18376    5683   4877                3067              2616     1640   
    04       20066    6354   5366                3587              2767     1775
postags

Lazy-loaded data.

wordclasses

Lazy-loaded data.

Subcorpus

class corpkit.corpus.Subcorpus(path, datatype)[source]

Bases: corpkit.corpus.Corpus

Model a subcorpus, containing files but no subdirectories.

Methods for interrogating, concordancing and configurations are the same as corpkit.corpus.Corpus.

File

class corpkit.corpus.File(path, dirname, datatype)[source]

Bases: corpkit.corpus.Corpus

Models a corpus file for reading, interrogating, concordancing

document

Return the parsed XML of a parsed file

read(**kwargs)[source]

Read file data. If data is pickled, unpickle first

Returns:str/unpickled data

Datalist

class corpkit.corpus.Datalist(data)[source]

Bases: object

A list-like object containing subcorpora or corpus files.

Objects can be accessed as attributes, dict keys or by indexing/slicing.

Methods for interrogating, concordancing and getting configurations are the same as for corpkit.corpus.Corpus

interrogate(*args, **kwargs)[source]

Interrogate the corpus using interrogate()

concordance(*args, **kwargs)[source]

Concordance the corpus using concordance()

configurations(search, **kwargs)[source]

Get a configuration using configurations()