Corpus classes¶

Much of corpkit‘s functionality comes from the ability to work with Corpus and Corpus-like objects, which have methods for parsing, tokenising, interrogating and concordancing.

Corpus¶

class corpkit.corpus.Corpus(path, **kwargs)[source]¶

Bases: object

A class representing a linguistic text corpus, which contains files, optionally within subcorpus folders.

Methods for concordancing, interrogating, getting general stats, getting behaviour of particular word, etc.

document¶: Return the parsed XML of a parsed file

read(**kwargs)[source]¶

Read file data. If data is pickled, unpickle first

Returns:	str/unpickled data

subcorpora¶

A list-like object containing a corpus’ subcorpora.

Example:

>>> corpus.subcorpora
<corpkit.corpus.Datalist instance: 12 items>

speakerlist¶: Lazy-loaded data.

files¶

A list-like object containing the files in a folder.

Example:

>>> corpus.subcorpora[0].files
<corpkit.corpus.Datalist instance: 240 items>

features¶

Generate and show basic stats from the corpus, including number of sentences, clauses, process types, etc.

Example:

>>> corpus.features
    ..  Characters  Tokens  Words  Closed class words  Open class words  Clauses
    01       26873    8513   7308                4809              3704     2212   
    02       25844    7933   6920                4313              3620     2270   
    03       18376    5683   4877                3067              2616     1640   
    04       20066    6354   5366                3587              2767     1775

wordclasses¶: Lazy-loaded data.

postags¶: Lazy-loaded data.

configurations(search, **kwargs)[source]¶

Get the overall behaviour of tokens or lemmas matching a regular expression. The search below makes DataFrames containing the most common subjects, objects, modifiers (etc.) of ‘see’:

Parameters:	search – Similar to search in the interrogate() /

concordance() methods. `W/L keys match word or lemma; `F: key specifies semantic role (‘participant’, ‘process’ or ‘modifier’. If F not specified, each role will be searched for. :type search: dict

Example:

>>> see = corpus.configurations({L: 'see', F: 'process'}, show = L)
>>> see.has_subject.results.sum()
    i           452
    it          227
    you         162
    we          111
    he           94

Returns:	`corpkit.interrogation.Interrodict`

interrogate(search, *args, **kwargs)[source]¶

Interrogate a corpus of texts for a lexicogrammatical phenomenon.

This method iterates over the files/folders in a corpus, searching the texts, and returning a corpkit.interrogation.Interrogation object containing the results. The main options are search, where you specify search criteria, and show, where you specify what you want to appear in the output.

Example:

>>> corpus = Corpus('data/conversations-parsed')
### show lemma form of nouns ending in 'ing'
>>> q = {W: r'ing$', P: r'^N'}
>>> data = corpus.interrogate(q, show = L)
>>> data.results
    ..  something  anything  thing  feeling  everything  nothing  morning
    01         14        11     12        1           6        0        1
    02         10        20      4        4           8        3        0
    03         14         5      5        3           1        0        0
    ...                                                               ...

Parameters: search (str or dict. dict is used when you have multiple criteria.) – What the query should be matching. - t: tree - w: word - l: lemma - p: pos - f: function - g/gw: governor - gl: governor’s lemma form - gp: governor’s pos - gf: governor’s function - d/dependent - dl: dependent’s lemma form - dp: dependent’s pos - df: dependent’s function - i/index - n/ngrams (deprecated, use show) - s/general stats

Keys are what to search as str, and values are the criteria, which is a Tregex query, a regex, or a list of words to match. Therefore, the two syntaxes below do the same thing:

Example:

>>> corpus.interrogate(T, r'/NN.?/')
>>> corpus.interrogate({T: r'/NN.?/'})

Parameters:	searchmode (str – ‘any’/‘all’) – Return results matching any/all criteria exclude (dict – {L: ‘be’}) – The inverse of search, removing results from search excludemode (str – ‘any’/‘all’) – Exclude results matching any/all criteria query – A search query for the interrogation. This is only used

when search is a string, or when multiprocessing. If search is a dict, the query/queries are stored there as the values instead. When multiprocessing, the following is possible:

Example:

>>> {'Nouns': r'/NN.?/', 'Verbs': r'/VB.?/'}
### return an :class:`corpkit.interrogation.Interrodict` object:
>>> corpus.interrogate(T, q)
### return an :class:`corpkit.interrogation.Interrogation` object:
>>> corpus.interrogate(T, q, show = C)

Parameters:	show – What to output. If multiple strings are passed in as a `list`, results

will be colon-separated, in the suppled order. If you want to show ngrams, you can’t have multiple values. Possible values are the same as those for search, plus:

a/distance from root

n/ngram

nl/ngram lemma

np/ngram POS

npl/ngram wordclass

Parameters:	lemmatise – Force lemmatisation on results. Deprecated:

instead, output a lemma form with the show argument :type lemmatise: bool

Parameters:	lemmatag – Explicitly pass a pos to lemmatiser (generally when data is unparsed),

or when tag cannot be recovered from Tregex query :type lemmatag: False/’n’/’v’/’a’/’r’

Parameters:	spelling (False/'US'/'UK') – Convert all to U.S. or U.K. English dep_type (str -- 'basic-dependencies'/'a',) – The kind of Stanford CoreNLP dependency parses you want to use

‘collapsed-dependencies’/’b’, ‘collapsed-ccprocessed-dependencies’/’c’

Parameters:

save (str) – Save result as pickle to saved_interrogations/<save> on completion
gramsize (int) – size of n-grams (default 2)
split_contractions (bool) – make “don’t” et al into two tokens
multiprocess (int / bool (to determine automatically)) – how many parallel processes to run
files_as_subcorpora (bool) – treat each file as a subcorpus
do_concordancing (bool/'only') – Concordance while interrogating, store as .concordance attribute
maxconc (int) – Maximum number of concordance lines

Returns:

A corpkit.interrogation.Interrogation object, with

.query, .results, .totals attributes. If multiprocessing is invoked, result may be a corpkit.interrogation.Interrodict containing corpus names, queries or speakers as keys.

parse(corenlppath=False, operations=False, copula_head=True, speaker_segmentation=False, memory_mb=False, multiprocess=False, split_texts=400, *args, **kwargs)[source]¶

Parse an unparsed corpus, saving to disk

Parameters:	corenlppath – folder containing corenlp jar files (use if corpkit can’t find

it automatically) :type corenlppath: str

Parameters:	operations (str) – which kinds of annotations to do speaker_segmentation – add speaker name to parser output if your corpus is

script-like :type speaker_segmentation: bool

Parameters:	memory_mb (int) – Amount of memory in MB for parser copula_head (bool) – Make copula head in dependency parse multiprocess (int) – Split parsing across n cores (for high-performance computers)
Example:

>>> parsed = corpus.parse(speaker_segmentation = True)
>>> parsed
<corpkit.corpus.Corpus instance: speeches-parsed; 9 subcorpora>

Returns:	The newly created `corpkit.corpus.Corpus`

tokenise(*args, **kwargs)[source]¶

Tokenise a plaintext corpus, saving to disk

Parameters:	nltk_data_path (str) – path to tokeniser if not found automatically
Example:

>>> tok = corpus.tokenise()
>>> tok
<corpkit.corpus.Corpus instance: speeches-tokenised; 9 subcorpora>

Returns:	The newly created `corpkit.corpus.Corpus`

concordance(*args, **kwargs)[source]¶

A concordance method for Tregex queries, CoreNLP dependencies, tokenised data or plaintext.

Example:

>>> wv = ['want', 'need', 'feel', 'desire']
>>> corpus.concordance({L: wv, F: 'root'})
   0   01  1-01.txt.xml                But , so I  feel     like i do that for w
   1   01  1-01.txt.xml                         I  felt     a little like oh , i
   2   01  1-01.txt.xml   he 's a difficult man I  feel     like his work ethic
   3   01  1-01.txt.xml                      So I  felt     like i recognized li
   ...                                                                       ...

Arguments are the same as interrogate(), plus:

Parameters:	only_format_match – if True, left and right window will just be words, regardless of

what is in show :type only_format_match: bool

Parameters:	only_unique (bool) – only unique lines
Returns:	A `corpkit.interrogation.Concordance` instance

interroplot(search, **kwargs)[source]¶

Interrogate, relativise, then plot, with very little customisability. A demo function.

Example:

>>> corpus.interroplot(r'/NN.?/ >># NP')
<matplotlib figure>

Parameters:	search (dict) – search as per `interrogate()` kwargs (keyword arguments) – extra arguments to pass to `visualise()`
Returns:	None (but show a plot)

save(savename=False, **kwargs)[source]¶

Save corpus class to file

>>> corpus.save(filename)

Parameters:	savename (str) – name for the file
Returns:	None

Corpora¶

class corpkit.corpus.Corpora(data=False, **kwargs)[source]¶

Bases: corpkit.corpus.Datalist

Models a collection of Corpus objects. Methods are available for interrogating and plotting the entire collection. This is the highest level of abstraction available.

Parameters:	data (str (path containing corpora), list (of corpus paths/Corpus) – Corpora to model

objects) corpkit.corpus.Corpus objects)

features¶

Generate and show basic stats from the corpus, including number of sentences, clauses, process types, etc.

Example:

>>> corpus.features
    ..  Characters  Tokens  Words  Closed class words  Open class words  Clauses
    01       26873    8513   7308                4809              3704     2212   
    02       25844    7933   6920                4313              3620     2270   
    03       18376    5683   4877                3067              2616     1640   
    04       20066    6354   5366                3587              2767     1775

postags¶: Lazy-loaded data.

wordclasses¶: Lazy-loaded data.

Subcorpus¶

class corpkit.corpus.Subcorpus(path, datatype)[source]¶

Bases: corpkit.corpus.Corpus

Model a subcorpus, containing files but no subdirectories.

Methods for interrogating, concordancing and configurations are the same as corpkit.corpus.Corpus.

File¶

class corpkit.corpus.File(path, dirname, datatype)[source]¶

Bases: corpkit.corpus.Corpus

Models a corpus file for reading, interrogating, concordancing

document¶: Return the parsed XML of a parsed file

read(**kwargs)[source]¶

Read file data. If data is pickled, unpickle first

Returns:	str/unpickled data

Datalist¶

class corpkit.corpus.Datalist(data)[source]¶

Bases: object

A list-like object containing subcorpora or corpus files.

Objects can be accessed as attributes, dict keys or by indexing/slicing.

Methods for interrogating, concordancing and getting configurations are the same as for corpkit.corpus.Corpus

interrogate(*args, **kwargs)[source]¶: Interrogate the corpus using interrogate()

concordance(*args, **kwargs)[source]¶: Concordance the corpus using concordance()

configurations(search, **kwargs)[source]¶: Get a configuration using configurations()