Corpus classes¶
Much of corpkit‘s functionality comes from the ability to work with Corpus
and Corpus
-like objects, which have methods for parsing, tokenising, interrogating and concordancing.
Corpus¶
-
class
corpkit.corpus.
Corpus
(path, **kwargs)[source]¶ Bases:
object
A class representing a linguistic text corpus, which contains files, optionally within subcorpus folders.
Methods for concordancing, interrogating, getting general stats, getting behaviour of particular word, etc.
-
document
¶ Return the parsed XML of a parsed file
-
read
(**kwargs)[source]¶ Read file data. If data is pickled, unpickle first
Returns: str/unpickled data
-
subcorpora
¶ A list-like object containing a corpus’ subcorpora.
Example: >>> corpus.subcorpora <corpkit.corpus.Datalist instance: 12 items>
-
speakerlist
¶ Lazy-loaded data.
-
files
¶ A list-like object containing the files in a folder.
Example: >>> corpus.subcorpora[0].files <corpkit.corpus.Datalist instance: 240 items>
-
features
¶ Generate and show basic stats from the corpus, including number of sentences, clauses, process types, etc.
Example: >>> corpus.features .. Characters Tokens Words Closed class words Open class words Clauses 01 26873 8513 7308 4809 3704 2212 02 25844 7933 6920 4313 3620 2270 03 18376 5683 4877 3067 2616 1640 04 20066 6354 5366 3587 2767 1775
-
wordclasses
¶ Lazy-loaded data.
Lazy-loaded data.
-
configurations
(search, **kwargs)[source]¶ Get the overall behaviour of tokens or lemmas matching a regular expression. The search below makes DataFrames containing the most common subjects, objects, modifiers (etc.) of ‘see’:
Parameters: search – Similar to search in the interrogate() / concordance() methods. `W/L keys match word or lemma; `F: key specifies semantic role (‘participant’, ‘process’ or ‘modifier’. If F not specified, each role will be searched for. :type search: dict
Example: >>> see = corpus.configurations({L: 'see', F: 'process'}, show = L) >>> see.has_subject.results.sum() i 452 it 227 you 162 we 111 he 94
Returns: corpkit.interrogation.Interrodict
-
interrogate
(search, *args, **kwargs)[source]¶ Interrogate a corpus of texts for a lexicogrammatical phenomenon.
This method iterates over the files/folders in a corpus, searching the texts, and returning a
corpkit.interrogation.Interrogation
object containing the results. The main options are search, where you specify search criteria, and show, where you specify what you want to appear in the output.Example: >>> corpus = Corpus('data/conversations-parsed') ### show lemma form of nouns ending in 'ing' >>> q = {W: r'ing$', P: r'^N'} >>> data = corpus.interrogate(q, show = L) >>> data.results .. something anything thing feeling everything nothing morning 01 14 11 12 1 6 0 1 02 10 20 4 4 8 3 0 03 14 5 5 3 1 0 0 ... ...
Parameters: search (str or dict. dict is used when you have multiple criteria.) – What the query should be matching. - t: tree - w: word - l: lemma - p: pos - f: function - g/gw: governor - gl: governor’s lemma form - gp: governor’s pos - gf: governor’s function - d/dependent - dl: dependent’s lemma form - dp: dependent’s pos - df: dependent’s function - i/index - n/ngrams (deprecated, use show
) - s/general statsKeys are what to search as str, and values are the criteria, which is a Tregex query, a regex, or a list of words to match. Therefore, the two syntaxes below do the same thing:
Example: >>> corpus.interrogate(T, r'/NN.?/') >>> corpus.interrogate({T: r'/NN.?/'})
Parameters: - searchmode (str – ‘any’/‘all’) – Return results matching any/all criteria
- exclude (dict – {L: ‘be’}) – The inverse of search, removing results from search
- excludemode (str – ‘any’/‘all’) – Exclude results matching any/all criteria
- query – A search query for the interrogation. This is only used
when search is a string, or when multiprocessing. If search is a dict, the query/queries are stored there as the values instead. When multiprocessing, the following is possible:
Example: >>> {'Nouns': r'/NN.?/', 'Verbs': r'/VB.?/'} ### return an :class:`corpkit.interrogation.Interrodict` object: >>> corpus.interrogate(T, q) ### return an :class:`corpkit.interrogation.Interrogation` object: >>> corpus.interrogate(T, q, show = C)
Parameters: show – What to output. If multiple strings are passed in as a list
, resultswill be colon-separated, in the suppled order. If you want to show ngrams, you can’t have multiple values. Possible values are the same as those for
search
, plus:- a/distance from root
- n/ngram
- nl/ngram lemma
- np/ngram POS
- npl/ngram wordclass
Parameters: lemmatise – Force lemmatisation on results. Deprecated: instead, output a lemma form with the show argument :type lemmatise: bool
Parameters: lemmatag – Explicitly pass a pos to lemmatiser (generally when data is unparsed), or when tag cannot be recovered from Tregex query :type lemmatag: False/’n’/’v’/’a’/’r’
Parameters: - spelling (False/'US'/'UK') – Convert all to U.S. or U.K. English
- dep_type (str -- 'basic-dependencies'/'a',) – The kind of Stanford CoreNLP dependency parses you want to use
‘collapsed-dependencies’/’b’, ‘collapsed-ccprocessed-dependencies’/’c’
Parameters: - save (str) – Save result as pickle to saved_interrogations/<save> on completion
- gramsize (int) – size of n-grams (default 2)
- split_contractions (bool) – make “don’t” et al into two tokens
- multiprocess (int / bool (to determine automatically)) – how many parallel processes to run
- files_as_subcorpora (bool) – treat each file as a subcorpus
- do_concordancing (bool/'only') – Concordance while interrogating, store as .concordance attribute
- maxconc (int) – Maximum number of concordance lines
Returns: A
corpkit.interrogation.Interrogation
object, with.query, .results, .totals attributes. If multiprocessing is invoked, result may be a
corpkit.interrogation.Interrodict
containing corpus names, queries or speakers as keys.
-
parse
(corenlppath=False, operations=False, copula_head=True, speaker_segmentation=False, memory_mb=False, multiprocess=False, split_texts=400, *args, **kwargs)[source]¶ Parse an unparsed corpus, saving to disk
Parameters: corenlppath – folder containing corenlp jar files (use if corpkit can’t find it automatically) :type corenlppath: str
Parameters: - operations (str) – which kinds of annotations to do
- speaker_segmentation – add speaker name to parser output if your corpus is
script-like :type speaker_segmentation: bool
Parameters: - memory_mb (int) – Amount of memory in MB for parser
- copula_head (bool) – Make copula head in dependency parse
- multiprocess (int) – Split parsing across n cores (for high-performance computers)
Example: >>> parsed = corpus.parse(speaker_segmentation = True) >>> parsed <corpkit.corpus.Corpus instance: speeches-parsed; 9 subcorpora>
Returns: The newly created corpkit.corpus.Corpus
-
tokenise
(*args, **kwargs)[source]¶ Tokenise a plaintext corpus, saving to disk
Parameters: nltk_data_path (str) – path to tokeniser if not found automatically Example: >>> tok = corpus.tokenise() >>> tok <corpkit.corpus.Corpus instance: speeches-tokenised; 9 subcorpora>
Returns: The newly created corpkit.corpus.Corpus
-
concordance
(*args, **kwargs)[source]¶ A concordance method for Tregex queries, CoreNLP dependencies, tokenised data or plaintext.
Example: >>> wv = ['want', 'need', 'feel', 'desire'] >>> corpus.concordance({L: wv, F: 'root'}) 0 01 1-01.txt.xml But , so I feel like i do that for w 1 01 1-01.txt.xml I felt a little like oh , i 2 01 1-01.txt.xml he 's a difficult man I feel like his work ethic 3 01 1-01.txt.xml So I felt like i recognized li ... ...
Arguments are the same as
interrogate()
, plus:Parameters: only_format_match – if True, left and right window will just be words, regardless of what is in
show
:type only_format_match: boolParameters: only_unique (bool) – only unique lines Returns: A corpkit.interrogation.Concordance
instance
-
interroplot
(search, **kwargs)[source]¶ Interrogate, relativise, then plot, with very little customisability. A demo function.
Example: >>> corpus.interroplot(r'/NN.?/ >># NP') <matplotlib figure>
Parameters: - search (dict) – search as per
interrogate()
- kwargs (keyword arguments) – extra arguments to pass to
visualise()
Returns: None (but show a plot)
- search (dict) – search as per
-
Corpora¶
-
class
corpkit.corpus.
Corpora
(data=False, **kwargs)[source]¶ Bases:
corpkit.corpus.Datalist
Models a collection of Corpus objects. Methods are available for interrogating and plotting the entire collection. This is the highest level of abstraction available.
Parameters: data (str (path containing corpora), list (of corpus paths/Corpus) – Corpora to model objects)
corpkit.corpus.Corpus
objects)-
features
¶ Generate and show basic stats from the corpus, including number of sentences, clauses, process types, etc.
Example: >>> corpus.features .. Characters Tokens Words Closed class words Open class words Clauses 01 26873 8513 7308 4809 3704 2212 02 25844 7933 6920 4313 3620 2270 03 18376 5683 4877 3067 2616 1640 04 20066 6354 5366 3587 2767 1775
Lazy-loaded data.
-
wordclasses
¶ Lazy-loaded data.
-
Subcorpus¶
-
class
corpkit.corpus.
Subcorpus
(path, datatype)[source]¶ Bases:
corpkit.corpus.Corpus
Model a subcorpus, containing files but no subdirectories.
Methods for interrogating, concordancing and configurations are the same as
corpkit.corpus.Corpus
.
File¶
-
class
corpkit.corpus.
File
(path, dirname, datatype)[source]¶ Bases:
corpkit.corpus.Corpus
Models a corpus file for reading, interrogating, concordancing
-
document
¶ Return the parsed XML of a parsed file
-
Datalist¶
-
class
corpkit.corpus.
Datalist
(data)[source]¶ Bases:
object
A list-like object containing subcorpora or corpus files.
Objects can be accessed as attributes, dict keys or by indexing/slicing.
Methods for interrogating, concordancing and getting configurations are the same as for
corpkit.corpus.Corpus
-
interrogate
(*args, **kwargs)[source]¶ Interrogate the corpus using
interrogate()
-
concordance
(*args, **kwargs)[source]¶ Concordance the corpus using
concordance()
-
configurations
(search, **kwargs)[source]¶ Get a configuration using
configurations()
-