Corpus classes¶

Much of corpkit‘s functionality comes from the ability to work with Corpus and Corpus-like objects, which have methods for parsing, tokenising, interrogating and concordancing.

Corpus¶

class corpkit.corpus.Corpus(path, **kwargs)[source]¶

Bases: object

A class representing a linguistic text corpus, which contains files, optionally within subcorpus folders.

Methods for concordancing, interrogating, getting general stats, getting behaviour of particular word, etc.

Unparsed, tokenised and parsed corpora use the same class, though some methods are available only to one or the other. Only unparsed corpora can be parsed, and only parsed/tokenised corpora can be interrogated.

subcorpora¶

A list-like object containing a corpus’ subcorpora.

Example:

>>> corpus.subcorpora
<corpkit.corpus.Datalist instance: 12 items>

speakerlist¶: Lazy-loaded data.

files¶

A list-like object containing the files in a folder.

Example:

>>> corpus.subcorpora[0].files
<corpkit.corpus.Datalist instance: 240 items>

all_filepaths¶: Lazy-loaded data.

conll_conform(errors='raise')[source]¶: This removes sent index column from old corpkit data

all_files¶: Lazy-loaded data.

tfidf(search={'w': 'any'}, show=['w'], **kwargs)[source]¶

Generate TF-IDF vector representation of corpus using interrogate method. All args and kwargs go to interrogate().

Returns:	Tuple: the vectoriser and matrix

features¶

Generate and show basic stats from the corpus, including number of sentences, clauses, process types, etc.

Example:

>>> corpus.features
    ..  Characters  Tokens  Words  Closed class words  Open class words  Clauses
    01       26873    8513   7308                4809              3704     2212   
    02       25844    7933   6920                4313              3620     2270   
    03       18376    5683   4877                3067              2616     1640   
    04       20066    6354   5366                3587              2767     1775

wordclasses¶: Lazy-loaded data.

postags¶: Lazy-loaded data.

lexicon¶: Lazy-loaded data.

configurations(search, **kwargs)[source]¶

Get the overall behaviour of tokens or lemmas matching a regular expression. The search below makes DataFrames containing the most common subjects, objects, modifiers (etc.) of ‘see’:

Parameters:

search (dict) –

Similar to search in the interrogate() method.

Valid keys are:

W/L match word or lemma

F: match a semantic role (‘participant’, ‘process’ or ‘modifier’. If F not specified, each role will be searched for.

Example:

>>> see = corpus.configurations({L: 'see', F: 'process'}, show=L)
>>> see.has_subject.results.sum()
    i           452
    it          227
    you         162
    we          111
    he           94

Returns:	`corpkit.interrogation.Interrodict`

interrogate(search='w', *args, **kwargs)[source]¶

Interrogate a corpus of texts for a lexicogrammatical phenomenon.

This method iterates over the files/folders in a corpus, searching the texts, and returning a corpkit.interrogation.Interrogation object containing the results. The main options are search, where you specify search criteria, and show, where you specify what you want to appear in the output.

Example:

>>> corpus = Corpus('data/conversations-parsed')
### show lemma form of nouns ending in 'ing'
>>> q = {W: r'ing$', P: r'^N'}
>>> data = corpus.interrogate(q, show=L)
>>> data.results
    ..  something  anything  thing  feeling  everything  nothing  morning
    01         14        11     12        1           6        0        1
    02         10        20      4        4           8        3        0
    03         14         5      5        3           1        0        0
    ...                                                               ...

Parameters:

search (dict) –

What part of the lexicogrammar to search, and what criteria to match. The keys are the thing to be searched, and values are the criteria. To search parse trees, use the T key, and a Tregex query as the value. When searching dependencies, you can use any of:

	Match	Governor	Dependent	Head
Word	W	G	D	H
Lemma	L	GL	DL	HL
Function	F	GF	DF	HF
POS tag	P	GP	DP	HP
Word class	X	GX	DX	HX
Distance from root	A	GA	DA	HA
Index	I	GI	DI	HI
Sentence index	S	SI	SI	SI

Values should be regular expressions or wordlists to match.

Example:

>>> corpus.interrogate({T: r'/NN.?/ < /^t/'}) # T- nouns, via trees
>>> corpus.interrogate({W: '^t': P: r'^v'}) # T- verbs, via dependencies

Parameters:

searchmode (str – ‘any’/‘all’) – Return results matching any/all criteria
exclude (dict – {L: ‘be’}) – The inverse of search, removing results from search
excludemode (str – ‘any’/‘all’) – Exclude results matching any/all criteria

query (str, dict or list) –

A search query for the interrogation. This is only used when search is a str, or when multiprocessing. When search If search is a str, the search criteria can be passed in as `query, in order to allow the simpler syntax:

>>> corpus.interrogate(GL, '(think|want|feel)')

When multiprocessing, the following is possible:

>>> q = {'Nouns': r'/NN.?/', 'Verbs': r'/VB.?/'}
### return an :class:`corpkit.interrogation.Interrogation` object with multiindex:
>>> corpus.interrogate(T, q)
### return an :class:`corpkit.interrogation.Interrogation` object without multiindex:
>>> corpus.interrogate(T, q, show=C)

show (str/list of strings) –

What to output. If multiple strings are passed in as a list, results will be colon-separated, in the suppled order. Possible values are the same as those for search, plus options n-gramming and getting collocates:

Show	Gloss	Example
N	N-gram word	The women were
NL	N-gram lemma	The woman be
NF	N-gram function	det nsubj root
NP	N-gram POS tag	DT NNS VBN
NX	N-gram word class	determiner noun verb
B	Collocate word	The_were
BL	Collocate lemma	The_be
BF	Collocate function	det_root
BP	Collocate POS tag	DT_VBN
BX	Collocate word class	determiner_verb

lemmatise (bool) – Force lemmatisation on results. Deprecated: instead, output a lemma form with the `show` argument
lemmatag (‘n’/‘v’/‘a’/‘r’/False) – When using a Tregex/Tgrep query, the tool will attempt to determine the word class of results from the query. Passing in a str here will tell the lemmatiser the expected POS of results to lemmatise. It only has an affect if trees are being searched and lemmata are being shown.
save (str) – Save result as pickle to saved_interrogations/<save> on completion
gramsize (int) – Size of n-grams (default 1, i.e. unigrams)
multiprocess (int/bool (bool determines automatically)) – How many parallel processes to run
files_as_subcorpora (bool) – (Deprecated, use subcorpora=files). Treat each file as a subcorpus, ignoring actual subcorpora if present
conc (bool/‘only’) – Generate a concordance while interrogating, store as .concordance attribute
coref (bool) – Also get coreferents for search matches
tgrep (bool) – Use TGrep for tree querying. TGrep is less expressive than Tregex, and is slower, but can work without Java. This option may be turned on internally if Java is not found.
subcorpora (str/list) – Use a metadata value as subcorpora. Passing a list will create a multiindex. ‘file’ and ‘folder’/‘default’ are also possible values.
just_metadata (dict) – One or more metadata fields and criteria to filter sentences by. Only those matching will be kept. Criteria can be a list of words or a regular expression. Passing {'speaker': 'ENVER'} will search only sentences annotated with speaker=ENVER.
skip_metadata (dict) – A field and regex/list to filter sentences by. The inverse of just_metadata.
discard (int/float) – When returning many (i.e. millions) of results, memory can be a problem. Setting a discard value will ignore results occurring infrequently in a subcorpus. An int will remove any result occurring n times or fewer. A float will remove this proportion of results (i.e. 0.1 will remove 10 per cent)

Returns:

A corpkit.interrogation.Interrogation object, with .query, .results, .totals attributes. If multiprocessing is invoked, result may be multiindexed.

sample(n, level='f')[source]¶

Get a sample of the corpus

Parameters:	n (`int`/`float`) – amount of data in the the sample. If an `int`, get n files. if a `float`, get float * 100 as a percentage of the corpus level (`str`) – sample subcorpora (`s`) or files (`f`)
Returns:	a Corpus object

delete_metadata()[source]¶: Delete metadata for corpus. May be needed if corpus is changed

metadata¶: Lazy-loaded data.

parse(corenlppath=False, operations=False, copula_head=True, speaker_segmentation=False, memory_mb=False, multiprocess=False, split_texts=400, outname=False, metadata=False, coref=True, *args, **kwargs)[source]¶

Parse an unparsed corpus, saving to disk

Parameters:

corenlppath (str) – Folder containing corenlp jar files (use if corpkit can’t find it automatically)
operations (str) – Which kinds of annotations to do
speaker_segmentation (bool) – Add speaker name to parser output if your corpus is script-like
memory_mb (int) – Amount of memory in MB for parser
copula_head (bool) – Make copula head in dependency parse
split_texts – Split texts longer than n lines for parser memory
multiprocess (int) – Split parsing across n cores (for high-performance computers)
folderise (bool) – If corpus is just files, move each into own folder
output_format (str) – Save parser output as xml, json, conll
outname (str) – Specify a name for the parsed corpus
metadata (bool) – Use if you have XML tags at the end of lines contaning metadata

Example:

>>> parsed = corpus.parse(speaker_segmentation=True)
>>> parsed
<corpkit.corpus.Corpus instance: speeches-parsed; 9 subcorpora>

Returns:	The newly created `corpkit.corpus.Corpus`

tokenise(postag=True, lemmatise=True, *args, **kwargs)[source]¶

Tokenise a plaintext corpus, saving to disk

Parameters:	nltk_data_path (str) – Path to tokeniser if not found automatically
Example:

>>> tok = corpus.tokenise()
>>> tok
<corpkit.corpus.Corpus instance: speeches-tokenised; 9 subcorpora>

Returns:	The newly created `corpkit.corpus.Corpus`

concordance(*args, **kwargs)[source]¶

A concordance method for Tregex queries, CoreNLP dependencies, tokenised data or plaintext.

Example:

>>> wv = ['want', 'need', 'feel', 'desire']
>>> corpus.concordance({L: wv, F: 'root'})
   0   01  1-01.txt.conll                But , so I  feel     like i do that for w
   1   01  1-01.txt.conll                         I  felt     a little like oh , i
   2   01  1-01.txt.conll   he 's a difficult man I  feel     like his work ethic
   3   01  1-01.txt.conll                      So I  felt     like i recognized li
   ...                                                                       ...

Arguments are the same as interrogate(), plus a few extra parameters:

Parameters:	only_format_match (bool) – If True, left and right window will just be words, regardless of what is in show only_unique (bool) – Return only unique lines maxconc (int) – Maximum number of concordance lines
Returns:	A `corpkit.interrogation.Concordance` instance, with columns showing filename, subcorpus name, speaker name, left context, match and right context.

interroplot(search, **kwargs)[source]¶

Interrogate, relativise, then plot, with very little customisability. A demo function.

Example:

>>> corpus.interroplot(r'/NN.?/ >># NP')
<matplotlib figure>

Parameters:	search (dict) – Search as per `interrogate()` kwargs (keyword arguments) – Extra arguments to pass to `visualise()`
Returns:	None (but show a plot)

save(savename=False, **kwargs)[source]¶

Save corpus instance to file. There’s not much reason to do this, really.

>>> corpus.save(filename)

Parameters:	savename (str) – Name for the file
Returns:	None

make_language_model(name, search={'w': 'any'}, exclude=False, show=['w', '+1mw'], **kwargs)[source]¶

Make a language model for the corpus

Parameters:	name (str) – a name for the model kwargs (keyword arguments) – keyword arguments for the interrogate() method
Returns:	a `corpkit.model.MultiModel`

annotate(conclines, annotation, dry_run=True)[source]¶

Annotate a corpus

Parameters:	conclines – a Concordance or DataFrame containing matches to annotate annotation (`str`/`dict`) – a tag or field and value dry_run (`bool`) – Show the annotations to be made, but don’t do them
Returns:	`None`

unannotate(annotation, dry_run=True)[source]¶

Delete annotation from a corpus

Parameters:	annotation (`str`/`dict`) – a tag or field and value
Returns:	`None`

Corpora¶

class corpkit.corpus.Corpora(data=False, **kwargs)[source]¶

Bases: corpkit.corpus.Datalist

Models a collection of Corpus objects. Methods are available for interrogating and plotting the entire collection. This is the highest level of abstraction available.

Parameters:	data (str/list) – Corpora to model. A str is interpreted as a path containing corpora. A list can be a list of corpus paths or `corpkit.corpus.Corpus` objects. )

parse(**kwargs)[source]¶

Parse multiple corpora

Parameters:	kwargs – Arguments to pass to the `parse()` method.
Returns:	`corpkit.corpus.Corpora`

features¶

Generate and show basic stats from the corpus, including number of sentences, clauses, process types, etc.

Example:

>>> corpus.features
    ..  Characters  Tokens  Words  Closed class words  Open class words  Clauses
    01       26873    8513   7308                4809              3704     2212   
    02       25844    7933   6920                4313              3620     2270   
    03       18376    5683   4877                3067              2616     1640   
    04       20066    6354   5366                3587              2767     1775

postags¶: Lazy-loaded data.

wordclasses¶: Lazy-loaded data.

lexicon¶: Lazy-loaded data.

Subcorpus¶

class corpkit.corpus.Subcorpus(path, datatype, **kwa)[source]¶

Bases: corpkit.corpus.Corpus

Model a subcorpus, containing files but no subdirectories.

Methods for interrogating, concordancing and configurations are the same as corpkit.corpus.Corpus.

File¶

class corpkit.corpus.File(path, dirname=False, datatype=False, **kwa)[source]¶

Bases: corpkit.corpus.Corpus

Models a corpus file for reading, interrogating, concordancing.

Methods for interrogating, concordancing and configurations are the same as corpkit.corpus.Corpus, plus methods for accessing the file contents directly as a str, or as a Pandas DataFrame.

read(**kwargs)[source]¶

Read file data. If data is pickled, unpickle first

Returns:	str/unpickled data

document¶: Return a DataFrame representation of a parsed file

trees¶: Lazy-loaded data.

plain¶: Lazy-loaded data.

Datalist¶

class corpkit.corpus.Datalist(data, **kwargs)[source]¶

Bases: list

interrogate(*args, **kwargs)[source]¶: Interrogate the corpus using interrogate()

concordance(*args, **kwargs)[source]¶: Concordance the corpus using concordance()

configurations(search, **kwargs)[source]¶: Get a configuration using configurations()