Corpus classes¶
Much of corpkit‘s functionality comes from the ability to work with Corpus
and Corpus
-like objects, which have methods for parsing, tokenising, interrogating and concordancing.
Corpus¶
-
class
corpkit.corpus.
Corpus
(path, **kwargs)[source]¶ Bases:
object
A class representing a linguistic text corpus, which contains files, optionally within subcorpus folders.
Methods for concordancing, interrogating, getting general stats, getting behaviour of particular word, etc.
Unparsed, tokenised and parsed corpora use the same class, though some methods are available only to one or the other. Only unparsed corpora can be parsed, and only parsed/tokenised corpora can be interrogated.
-
subcorpora
¶ A list-like object containing a corpus’ subcorpora.
Example: >>> corpus.subcorpora <corpkit.corpus.Datalist instance: 12 items>
-
speakerlist
¶ Lazy-loaded data.
-
files
¶ A list-like object containing the files in a folder.
Example: >>> corpus.subcorpora[0].files <corpkit.corpus.Datalist instance: 240 items>
-
all_filepaths
¶ Lazy-loaded data.
-
all_files
¶ Lazy-loaded data.
-
tfidf
(search={'w': 'any'}, show=['w'], **kwargs)[source]¶ Generate TF-IDF vector representation of corpus using interrogate method. All args and kwargs go to
interrogate()
.Returns: Tuple: the vectoriser and matrix
-
features
¶ Generate and show basic stats from the corpus, including number of sentences, clauses, process types, etc.
Example: >>> corpus.features .. Characters Tokens Words Closed class words Open class words Clauses 01 26873 8513 7308 4809 3704 2212 02 25844 7933 6920 4313 3620 2270 03 18376 5683 4877 3067 2616 1640 04 20066 6354 5366 3587 2767 1775
-
wordclasses
¶ Lazy-loaded data.
Lazy-loaded data.
-
lexicon
¶ Lazy-loaded data.
-
configurations
(search, **kwargs)[source]¶ Get the overall behaviour of tokens or lemmas matching a regular expression. The search below makes DataFrames containing the most common subjects, objects, modifiers (etc.) of ‘see’:
Parameters: search (dict) – Similar to search in the
interrogate()
method.Valid keys are:
- W/L match word or lemma
- F: match a semantic role (‘participant’, ‘process’ or ‘modifier’. If F not specified, each role will be searched for.
Example: >>> see = corpus.configurations({L: 'see', F: 'process'}, show=L) >>> see.has_subject.results.sum() i 452 it 227 you 162 we 111 he 94
Returns: corpkit.interrogation.Interrodict
-
interrogate
(search='w', *args, **kwargs)[source]¶ Interrogate a corpus of texts for a lexicogrammatical phenomenon.
This method iterates over the files/folders in a corpus, searching the texts, and returning a
corpkit.interrogation.Interrogation
object containing the results. The main options are search, where you specify search criteria, and show, where you specify what you want to appear in the output.Example: >>> corpus = Corpus('data/conversations-parsed') ### show lemma form of nouns ending in 'ing' >>> q = {W: r'ing$', P: r'^N'} >>> data = corpus.interrogate(q, show=L) >>> data.results .. something anything thing feeling everything nothing morning 01 14 11 12 1 6 0 1 02 10 20 4 4 8 3 0 03 14 5 5 3 1 0 0 ... ...
Parameters: search (dict) – What part of the lexicogrammar to search, and what criteria to match. The keys are the thing to be searched, and values are the criteria. To search parse trees, use the T key, and a Tregex query as the value. When searching dependencies, you can use any of:
Match Governor Dependent Head Word W G D H Lemma L GL DL HL Function F GF DF HF POS tag P GP DP HP Word class X GX DX HX Distance from root A GA DA HA Index I GI DI HI Sentence index S SI SI SI Values should be regular expressions or wordlists to match.
Example: >>> corpus.interrogate({T: r'/NN.?/ < /^t/'}) # T- nouns, via trees >>> corpus.interrogate({W: '^t': P: r'^v'}) # T- verbs, via dependencies
Parameters: - searchmode (str – ‘any’/‘all’) – Return results matching any/all criteria
- exclude (dict – {L: ‘be’}) – The inverse of search, removing results from search
- excludemode (str – ‘any’/‘all’) – Exclude results matching any/all criteria
- query (str, dict or list) –
A search query for the interrogation. This is only used when search is a str, or when multiprocessing. When search If search is a str, the search criteria can be passed in as `query, in order to allow the simpler syntax:
>>> corpus.interrogate(GL, '(think|want|feel)')
When multiprocessing, the following is possible:
>>> q = {'Nouns': r'/NN.?/', 'Verbs': r'/VB.?/'} ### return an :class:`corpkit.interrogation.Interrogation` object with multiindex: >>> corpus.interrogate(T, q) ### return an :class:`corpkit.interrogation.Interrogation` object without multiindex: >>> corpus.interrogate(T, q, show=C)
- show (str/list of strings) –
What to output. If multiple strings are passed in as a list, results will be colon-separated, in the suppled order. Possible values are the same as those for search, plus options n-gramming and getting collocates:
Show Gloss Example N N-gram word The women were NL N-gram lemma The woman be NF N-gram function det nsubj root NP N-gram POS tag DT NNS VBN NX N-gram word class determiner noun verb B Collocate word The_were BL Collocate lemma The_be BF Collocate function det_root BP Collocate POS tag DT_VBN BX Collocate word class determiner_verb - lemmatise (bool) – Force lemmatisation on results. Deprecated: instead, output a lemma form with the `show` argument
- lemmatag (‘n’/‘v’/‘a’/‘r’/False) – When using a Tregex/Tgrep query, the tool will attempt to determine the word class of results from the query. Passing in a str here will tell the lemmatiser the expected POS of results to lemmatise. It only has an affect if trees are being searched and lemmata are being shown.
- save (str) – Save result as pickle to saved_interrogations/<save> on completion
- gramsize (int) – Size of n-grams (default 1, i.e. unigrams)
- multiprocess (int/bool (bool determines automatically)) – How many parallel processes to run
- files_as_subcorpora (bool) – (Deprecated, use subcorpora=files). Treat each file as a subcorpus, ignoring actual subcorpora if present
- conc (bool/‘only’) – Generate a concordance while interrogating, store as .concordance attribute
- coref (bool) – Also get coreferents for search matches
- tgrep (bool) – Use TGrep for tree querying. TGrep is less expressive than Tregex, and is slower, but can work without Java. This option may be turned on internally if Java is not found.
- subcorpora (str/list) – Use a metadata value as subcorpora. Passing a list will create a multiindex. ‘file’ and ‘folder’/‘default’ are also possible values.
- just_metadata (dict) – One or more metadata fields and criteria to filter sentences by.
Only those matching will be kept. Criteria can be a list of words
or a regular expression. Passing
{'speaker': 'ENVER'}
will search only sentences annotated withspeaker=ENVER
. - skip_metadata (dict) – A field and regex/list to filter sentences by.
The inverse of
just_metadata
. - discard (
int
/float
) – When returning many (i.e. millions) of results, memory can be a problem. Setting a discard value will ignore results occurring infrequently in a subcorpus. Anint
will remove any result occurringn
times or fewer. A float will remove this proportion of results (i.e. 0.1 will remove 10 per cent)
Returns: A
corpkit.interrogation.Interrogation
object, with .query, .results, .totals attributes. If multiprocessing is invoked, result may be multiindexed.
-
sample
(n, level='f')[source]¶ Get a sample of the corpus
Parameters: - n (
int
/float
) – amount of data in the the sample. If anint
, get n files. if afloat
, get float * 100 as a percentage of the corpus - level (
str
) – sample subcorpora (s
) or files (f
)
Returns: a Corpus object
- n (
-
metadata
¶ Lazy-loaded data.
-
parse
(corenlppath=False, operations=False, copula_head=True, speaker_segmentation=False, memory_mb=False, multiprocess=False, split_texts=400, outname=False, metadata=False, coref=True, *args, **kwargs)[source]¶ Parse an unparsed corpus, saving to disk
Parameters: - corenlppath (str) – Folder containing corenlp jar files (use if corpkit can’t find it automatically)
- operations (str) – Which kinds of annotations to do
- speaker_segmentation (bool) – Add speaker name to parser output if your corpus is script-like
- memory_mb (int) – Amount of memory in MB for parser
- copula_head (bool) – Make copula head in dependency parse
- split_texts – Split texts longer than n lines for parser memory
- multiprocess (int) – Split parsing across n cores (for high-performance computers)
- folderise (bool) – If corpus is just files, move each into own folder
- output_format (str) – Save parser output as xml, json, conll
- outname (str) – Specify a name for the parsed corpus
- metadata (bool) – Use if you have XML tags at the end of lines contaning metadata
Example: >>> parsed = corpus.parse(speaker_segmentation=True) >>> parsed <corpkit.corpus.Corpus instance: speeches-parsed; 9 subcorpora>
Returns: The newly created corpkit.corpus.Corpus
-
tokenise
(postag=True, lemmatise=True, *args, **kwargs)[source]¶ Tokenise a plaintext corpus, saving to disk
Parameters: nltk_data_path (str) – Path to tokeniser if not found automatically Example: >>> tok = corpus.tokenise() >>> tok <corpkit.corpus.Corpus instance: speeches-tokenised; 9 subcorpora>
Returns: The newly created corpkit.corpus.Corpus
-
concordance
(*args, **kwargs)[source]¶ A concordance method for Tregex queries, CoreNLP dependencies, tokenised data or plaintext.
Example: >>> wv = ['want', 'need', 'feel', 'desire'] >>> corpus.concordance({L: wv, F: 'root'}) 0 01 1-01.txt.conll But , so I feel like i do that for w 1 01 1-01.txt.conll I felt a little like oh , i 2 01 1-01.txt.conll he 's a difficult man I feel like his work ethic 3 01 1-01.txt.conll So I felt like i recognized li ... ...
Arguments are the same as
interrogate()
, plus a few extra parameters:Parameters: - only_format_match (bool) – If True, left and right window will just be words, regardless of what is in show
- only_unique (bool) – Return only unique lines
- maxconc (int) – Maximum number of concordance lines
Returns: A
corpkit.interrogation.Concordance
instance, with columns showing filename, subcorpus name, speaker name, left context, match and right context.
-
interroplot
(search, **kwargs)[source]¶ Interrogate, relativise, then plot, with very little customisability. A demo function.
Example: >>> corpus.interroplot(r'/NN.?/ >># NP') <matplotlib figure>
Parameters: - search (dict) – Search as per
interrogate()
- kwargs (keyword arguments) – Extra arguments to pass to
visualise()
Returns: None (but show a plot)
- search (dict) – Search as per
-
save
(savename=False, **kwargs)[source]¶ Save corpus instance to file. There’s not much reason to do this, really.
>>> corpus.save(filename)
Parameters: savename (str) – Name for the file Returns: None
-
make_language_model
(name, search={'w': 'any'}, exclude=False, show=['w', '+1mw'], **kwargs)[source]¶ Make a language model for the corpus
Parameters: - name (str) – a name for the model
- kwargs (keyword arguments) – keyword arguments for the interrogate() method
Returns: a
corpkit.model.MultiModel
-
Corpora¶
-
class
corpkit.corpus.
Corpora
(data=False, **kwargs)[source]¶ Bases:
corpkit.corpus.Datalist
Models a collection of Corpus objects. Methods are available for interrogating and plotting the entire collection. This is the highest level of abstraction available.
Parameters: data (str/list) – Corpora to model. A str is interpreted as a path containing corpora. A list can be a list of corpus paths or corpkit.corpus.Corpus
objects. )-
parse
(**kwargs)[source]¶ Parse multiple corpora
Parameters: kwargs – Arguments to pass to the parse()
method.Returns: corpkit.corpus.Corpora
-
features
¶ Generate and show basic stats from the corpus, including number of sentences, clauses, process types, etc.
Example: >>> corpus.features .. Characters Tokens Words Closed class words Open class words Clauses 01 26873 8513 7308 4809 3704 2212 02 25844 7933 6920 4313 3620 2270 03 18376 5683 4877 3067 2616 1640 04 20066 6354 5366 3587 2767 1775
Lazy-loaded data.
-
wordclasses
¶ Lazy-loaded data.
-
lexicon
¶ Lazy-loaded data.
-
Subcorpus¶
-
class
corpkit.corpus.
Subcorpus
(path, datatype, **kwa)[source]¶ Bases:
corpkit.corpus.Corpus
Model a subcorpus, containing files but no subdirectories.
Methods for interrogating, concordancing and configurations are the same as
corpkit.corpus.Corpus
.
File¶
-
class
corpkit.corpus.
File
(path, dirname=False, datatype=False, **kwa)[source]¶ Bases:
corpkit.corpus.Corpus
Models a corpus file for reading, interrogating, concordancing.
Methods for interrogating, concordancing and configurations are the same as
corpkit.corpus.Corpus
, plus methods for accessing the file contents directly as a str, or as a Pandas DataFrame.-
read
(**kwargs)[source]¶ Read file data. If data is pickled, unpickle first
Returns: str/unpickled data
-
document
¶ Return a DataFrame representation of a parsed file
-
trees
¶ Lazy-loaded data.
-
plain
¶ Lazy-loaded data.
-
Datalist¶
-
class
corpkit.corpus.
Datalist
(data, **kwargs)[source]¶ Bases:
list
-
interrogate
(*args, **kwargs)[source]¶ Interrogate the corpus using
interrogate()
-
concordance
(*args, **kwargs)[source]¶ Concordance the corpus using
concordance()
-
configurations
(search, **kwargs)[source]¶ Get a configuration using
configurations()
-