Managing projects¶
corpkit
has a few other bits and pieces designed to make life easier when doing corpus linguistic work. This includes methods for loading saved data, for working with multiple corpora at the same time, and for switching between command line and graphical interfaces. Those things are covered here.
Loading saved data¶
When you’re starting a new session, you probably don’t want to start totally from scratch. It’s handy to be able to load your previous work. You can load data in a few ways.
First, you can use corpkit.load()
, using the name of the filename you’d like to load. By default, corpkit
looks in the saved_interrogations
directory, but you can pass in an absolute path instead if you like.
>>> import corpkit
>>> nouns = corpkit.load('nouns')
Second, you can use corpkit.loader()
, which provides a list of items to load, and asks the user for input:
>>> nouns = corpkit.loader()
Third, when instantiating a Corpus
object, you can add load_saved=True
keyword argument to load any saved data belonging to this corpus as an attribute.
>>> corpus = Corpus('data/psyc-parsed', load_saved=True)
A final alternative approach stores all interrogations within an corpkit.interrogation.Interrodict
object object:
>>> r = corpkit.load_all_results()
Managing multiple corpora¶
corpkit
can handle one further level of abstraction for both Corpus and Interrogations. corpkit.corpus.Corpora
models a collection of corpkit.corpus.Corpus
objects. To create one, pass in a directory containing corpora, or a list of paths/Corpus
objects:
>>> from corpkit import Corpora
>>> corpora = Corpora('data')
Individual corpora can be accessed as attributes, by index, or as keys:
>>> corpora.first
>>> corpora[0]
>>> corpora['first']
You can use the interrogate()
method to search them, using the same arguments as you would for interrogate()
.
Interrogating these objects often returns an corpkit.interrogation.Interrodict
object, which models a collection of DataFrames.
Editing can be performed with edit()
. The editor will iterate over each DataFrame in turn, generally returning another Interrodict
.
Note
There is no visualise()
method for Interrodict objects.
multiindex()
can turn an Interrodict
into a Pandas MultiIndex:
>>> multiple_res.multiinedx()
collapse()
will collapse one dimension of the Interrodict
. You can collapse the x axis ('x'
), the y axis ('y'
), or the Interrodict keys ('k'
). In the example below, an Interrodict
is collapsed along each axis in turn.
>>> d = corpora.interrogate({F: 'compound', GL: r'^risk'}, show=L)
>>> d.keys()
['CHT', 'WAP', 'WSJ']
>>> d['CHT'].results
.... health cancer security credit flight safety heart
1987 87 25 28 13 7 6 4
1988 72 24 20 15 7 4 9
1989 137 61 23 10 5 5 6
>>> d.collapse(axis=Y).results
... health cancer credit security
CHT 3174 1156 566 697
WAP 2799 933 582 1127
WSJ 1812 680 2009 537
>>> d.collapse(axis=X).results
... 1987 1988 1989
CHT 384 328 464
WAP 389 355 435
WSJ 428 410 473
>>> d.collapse(axis=K).results
... health cancer credit security
1987 282 127 65 93
1988 277 100 70 107
1989 379 253 83 91
topwords()
quickly shows the top results from every interrogation in the Interrodict
.
>>> data.topwords(n=5)
Output:
TBT % UST % WAP % WSJ %
health 25.70 health 15.25 health 19.64 credit 9.22
security 6.48 cancer 10.85 security 7.91 health 8.31
cancer 6.19 heart 6.31 cancer 6.55 downside 5.46
flight 4.45 breast 4.29 credit 4.08 inflation 3.37
safety 3.49 security 3.94 safety 3.26 cancer 3.12
Using the GUI¶
corpkit
is also designed to work as a GUI. It can be started in bash
with:
$ python -m corpkit.gui
The GUI can understand any projects you have defined. If you open it, you can simply select your project via Open Project
and resume work in a graphical environment.