Interrogation classes¶

Once you have searched a Corpus object, you’ll want to be able to edit, visualise and store results. Remember that upon importing corpkit, any pandas.DataFrame or pandas.Series object is monkey-patched with save, edit and visualise methods.

Interrogation¶

class corpkit.interrogation.Interrogation(results=None, totals=None, query=None, concordance=None)[source]¶

Bases: object

Stores results of a corpus interrogation, before or after editing. The main attribute, results, is a Pandas object, which can be edited or plotted.

results = None¶: pandas DataFrame containing counts for each subcorpus

totals = None¶: pandas Series containing summed results

query = None¶: dict containing values that generated the result

concordance = None¶: pandas DataFrame containing concordance lines, if concordance lines were requested.

edit(*args, **kwargs)[source]¶

Manipulate results of interrogations.

There are a few overall kinds of edit, most of which can be combined into a single function call. It’s useful to keep in mind that many are basic wrappers around pandas operations—if you’re comfortable with pandas syntax, it may be faster at times to use its syntax instead.

Basic mathematical operations:

First, you can do basic maths on results, optionally passing in some data to serve as the denominator. Very commonly, you’ll want to get relative frequencies:

Example:

>>> data = corpus.interrogate({W: r'^t'})
>>> rel = data.edit('%', SELF)
>>> rel.results
    ..    to  that   the  then ...   toilet  tolerant  tolerate  ton
    01 18.50 14.65 14.44  6.20 ...     0.00      0.00      0.11 0.00
    02 24.10 14.34 13.73  8.80 ...     0.00      0.00      0.00 0.00
    03 17.31 18.01  9.97  7.62 ...     0.00      0.00      0.00 0.00

For the operation, there are a number of possible values, each of which is to be passed in as a str:

+, -, /, *, %: self explanatory

k: calculate keywords

a: get distance metric

SELF is a very useful shorthand denominator. When used, all editing is performed on the data. The totals are then extracted from the edited data, and used as denominator. If this is not the desired behaviour, however, a more specific interrogation.results or interrogation.totals attribute can be used.

In the example above, SELF (or ‘self’) is equivalent to:

Example:

>>> rel = data.edit('%', data.totals)

Keeping and skipping data:

There are four keyword arguments that can be used to keep or skip rows or columns in the data:

just_entries
just_subcorpora
skip_entries
skip_subcorpora

Each can accept different input types:

str: treated as regular expression to match
list:
- of integers: indices to match
- of strings: entries/subcorpora to match

Example:

>>> data.edit(just_entries=r'^fr', 
...           skip_entries=['free','freedom'],
...           skip_subcorpora=r'[0-9]')

Merging data:

There are also keyword arguments for merging entries and subcorpora:

merge_entries
merge_subcorpora

These take a dict, with the new name as key and the criteria as value. The criteria can be a str (regex) or wordlist.

Example:

>>> from dictionaries.wordlists import wordlists
>>> mer = {'Articles': ['the', 'an', 'a'], 'Modals': wordlists.modals}
>>> data.edit(merge_entries=mer)

Sorting:

The sort_by keyword argument takes a str, which represents the way the result columns should be ordered.

increase: highest to lowest slope value
decrease: lowest to highest slope value
turbulent: most change in y axis values
static: least change in y axis values
total/most: largest number first
infreq/least: smallest number first
name: alphabetically

Example:

>>> data.edit(sort_by='increase')

Editing text:

Column labels, corresponding to individual interrogation results, can also be edited with replace_names.

Parameters:	replace_names (str/list of tuples/dict) – Edit result names, then merge duplicate entries

If replace_names is a string, it is treated as a regex to delete from each name. If replace_names is a dict, the value is the regex, and the key is the replacement text. Using a list of tuples in the form (find, replacement) allows duplicate substitution values.

Example:

>>> data.edit(replace_names={r'object': r'[di]obj'})

Parameters:	replace_subcorpus_names (str/list of tuples/dict) – Edit subcorpus names, then merge duplicates. The same as replace_names, but on the other axis.
Other options:

There are many other miscellaneous options.

Parameters:

keep_stats (bool) – Keep/drop stats values from dataframe after sorting
keep_top (int) – After sorting, remove all but the top keep_top results
just_totals (bool) – Sum each column and work with sums
threshold (int/bool) –

When using results list as dataframe 2, drop values

occurring fewer than n times. If not keywording, you can use:

‘high’: denominator total / 2500

‘medium’: denominator total / 5000

‘low’: denominator total / 10000

If keywording, there are smaller default thresholds
span_subcorpora (tuple – (int, int2)) – If subcorpora are numerically named, span all from int to int2, inclusive
projection (tuple – (subcorpus_name, n)) – multiply results in subcorpus by n
remove_above_p (bool) – Delete any result over p
p (float) – set the p value
revert_year (bool) – When doing linear regression on years, turn annual subcorpora into 1, 2 ...
print_info (bool) – Print stuff to console showing what’s being edited
spelling (str – ‘US’/‘UK’) – Convert/normalise spelling:

Keywording options:

If the operation is k, you’re calculating keywords. In this case, some other keyword arguments have an effect:

Parameters:

keyword_measure –

what measure to use to calculate keywords:

ll: log-likelihood `pd’: percentage difference

type keyword_measure: str

Parameters:	selfdrop (bool) – When keywording, try to remove target corpus from reference corpus calc_all (bool) – When keywording, calculate words that appear in either corpus
Returns:	`corpkit.interrogation.Interrogation`

sort(way, **kwargs)[source]¶

visualise(title='', x_label=None, y_label=None, style='ggplot', figsize=(8, 4), save=False, legend_pos='best', reverse_legend='guess', num_to_plot=7, tex='try', colours='Accent', cumulative=False, pie_legend=True, rot=False, partial_pie=False, show_totals=False, transparent=False, output_format='png', interactive=False, black_and_white=False, show_p_val=False, indices=False, transpose=False, **kwargs)[source]¶

Visualise corpus interrogations using matplotlib.

Example:

>>> data.visualise('An example plot', kind='bar', save=True)
<matplotlib figure>

Parameters:

title (str) – A title for the plot
x_label (str) – A label for the x axis
y_label (str) – A label for the y axis
kind (str (‘line’/‘bar’/‘barh’/‘pie’/‘area’/‘heatmap’)) – The kind of chart to make
style (str (‘ggplot’/’bmh’/’fivethirtyeight’/’seaborn-talk’/etc)) – Visual theme of plot
figsize (tuple – (int, int)) – Size of plot
save (bool/str) – If bool, save with title as name; if str, use str as name
legend_pos (str (‘upper right’/’outside right’/etc)) – Where to place legend
reverse_legend (bool) – Reverse the order of the legend
num_to_plot (int/’all’) – How many columns to plot
tex (bool) – Use TeX to draw plot text
colours (str) – Colourmap for lines/bars/slices
cumulative (bool) – Plot values cumulatively
pie_legend (bool) – Show a legend for pie chart
partial_pie (bool) – Allow plotting of pie slices only
show_totals (str – ‘legend’/’plot’/’both’) – Print sums in plot where possible
transparent (bool) – Transparent .png background
output_format (str – ‘png’/’pdf’) – File format for saved image
black_and_white (bool) – Create black and white line styles
show_p_val (bool) – Attempt to print p values in legend if contained in df
indices (bool) – To use when plotting “distance from root”
stacked (str) – When making bar chart, stack bars on top of one another
filled (str) – For area and bar charts, make every column sum to 100
legend (bool) – Show a legend
rot (int) – Rotate x axis ticks by rot degrees
subplots (bool) – Plot each column separately
layout (tuple – (int, int)) – Grid shape to use when subplots is True
interactive (list – [1, 2, 3]) – Experimental interactive options

Returns:

matplotlib figure

multiplot(leftdict={}, rightdict={}, **kwargs)[source]¶

language_model(name, *args, **kwargs)[source]¶: Make a language model from an Interrogation. This is usually done directly on a corpkit.corpus.Corpus object with the make_language_model() method.

save(savename, savedir='saved_interrogations', **kwargs)[source]¶

Save an interrogation as pickle to savedir.

Example:

>>> o = corpus.interrogate(W, 'any')
### create ./saved_interrogations/savename.p
>>> o.save('savename')

Parameters:	savename (str) – A name for the saved file savedir (str) – Relative path to directory in which to save file print_info (bool) – Show/hide stdout
Returns:	None

quickview(n=25)[source]¶

view top n results as painlessly as possible.

Example:

>>> data.quickview(n=5)
to    (n=2227)
that  (n=2026)
the   (n=1302)
then  (n=857)
think (n=676)

Parameters:	n (int) – Show top n results
Returns:	None

tabview(**kwargs)[source]¶

asciiplot(row_or_col_name, axis=0, colours=True, num_to_plot=100, line_length=120, min_graph_length=50, separator_length=4, multivalue=False, human_readable='si', graphsymbol='*', float_format='{:, .2f}', **kwargs)[source]¶: A very quick ascii chart for result

rel(denominator='self', **kwargs)[source]¶

keyness(measure='ll', denominator='self', **kwargs)[source]¶

multiindex(indexnames=None)[source]¶

Create a pandas.MultiIndex object from slash-separated results.

Example:

>>> data = corpus.interrogate({W: 'st$'}, show=[L, F])
>>> data.results
    ..  just/advmod  almost/advmod  last/amod 
    01           79             12          6 
    02          105              6          7 
    03           86             10          1 
>>> data.multiindex().results
    Lemma       just almost last first   most 
    Function  advmod advmod amod  amod advmod 
    0             79     12    6     2      3 
    1            105      6    7     1      3 
    2             86     10    1     3      0                                   

Parameters:	indexnames (list of strings) – provide custom names for the new index, or leave blank to guess.
Returns:	`corpkit.interrogation.Interrogation`, with pandas.MultiIndex as

results attribute

topwords(datatype='n', n=10, df=False, sort=True, precision=2)[source]¶

Show top n results in each corpus alongside absolute or relative frequencies.

Parameters:	datatype (str (n/k/%)) – show abs/rel frequencies, or keyness n (int) – number of result to show df (bool) – return a DataFrame sort (bool) – Sort results, or show as is precision (int) – float precision to show
Example:

>>> data.topwords(n=5)
    1987           %   1988           %   1989           %   1990           %
    health     25.70   health     15.25   health     19.64   credit      9.22
    security    6.48   cancer     10.85   security    7.91   health      8.31
    cancer      6.19   heart       6.31   cancer      6.55   downside    5.46
    flight      4.45   breast      4.29   credit      4.08   inflation   3.37
    safety      3.49   security    3.94   safety      3.26   cancer      3.12

Returns:	None

perplexity()[source]¶

Pythonification of the formal definition of perplexity.

input: a sequence of chances (any iterable will do) output: perplexity value.

from https://github.com/zeffii/NLP_class_notes

entropy()[source]¶: entropy(pos.edit(merge_entries=mergetags, sort_by=’total’).results.T

shannon()[source]¶

Interrodict¶

class corpkit.interrogation.Interrodict(data)[source]¶

Bases: collections.OrderedDict

A class for interrogations that do not fit in a single-indexed DataFrame.

Individual interrogations can be looked up via dict keys, indexes or attributes:

Example:

>>> out_data['WSJ'].results
>>> out_data.WSJ.results
>>> out_data[3].results

Methods for saving, editing, etc. are similar to corpkit.corpus.Interrogation. Additional methods are available for collapsing into single (multi-indexed) DataFrames.

This class is now deprecated, in favour of a multiindexed DataFrame.

edit(*args, **kwargs)[source]¶

Edit each value with edit().

See edit() for possible arguments.

Returns:	A `corpkit.interrogation.Interrodict`

multiindex(indexnames=False)[source]¶

Create a pandas.MultiIndex version of results.

Example:

>>> d = corpora.interrogate({F: 'compound', GL: '^risk'}, show=L)
>>> d.keys()
    ['CHT', 'WAP', 'WSJ']
>>> d['CHT'].results
    ....  health  cancer  security  credit  flight  safety  heart
    1987      87      25        28      13       7       6      4
    1988      72      24        20      15       7       4      9
    1989     137      61        23      10       5       5      6
>>> d.multiindex().results
    ...               health  cancer  credit  security  downside  
    Corpus Subcorpus                                             
    CHT    1987           87      25      13        28        20 
           1988           72      24      15        20        12 
           1989          137      61      10        23        10                                                      
    WAP    1987           83      44       8        44        10 
           1988           83      27      13        40         6 
           1989           95      77      18        25        12 
    WSJ    1987           52      27      33         4        21 
           1988           39      11      37         9        22 
           1989           55      47      43         9        24 

Returns:	A `corpkit.interrogation.Interrogation`

save(savename, savedir='saved_interrogations', **kwargs)[source]¶

Save an interrogation as pickle to savedir.

Parameters:	savename (str) – A name for the saved file savedir (str) – Relative path to directory in which to save file print_info (bool) – Show/hide stdout
Example:

>>> o = corpus.interrogate(W, 'any')
### create ``saved_interrogations/savename.p``
>>> o.save('savename')

Returns:	None

collapse(axis='y')[source]¶

Collapse Interrodict on an axis or along interrogation name.

Parameters:	axis (str: x/y/n) – collapse along x, y or name axis
Example:

>>> d = corpora.interrogate({F: 'compound', GL: r'^risk'}, show=L)

>>> d.keys()
    ['CHT', 'WAP', 'WSJ']

>>> d['CHT'].results
    ....  health  cancer  security  credit  flight  safety  heart
    1987      87      25        28      13       7       6      4
    1988      72      24        20      15       7       4      9
    1989     137      61        23      10       5       5      6

>>> d.collapse().results
    ...  health  cancer  credit  security
    CHT    3174    1156     566       697
    WAP    2799     933     582      1127
    WSJ    1812     680    2009       537

>>> d.collapse(axis='x').results
    ...  1987  1988  1989
    CHT   384   328   464
    WAP   389   355   435
    WSJ   428   410   473

>>> d.collapse(axis='key').results
    ...   health  cancer  credit  security
    1987     282     127      65        93
    1988     277     100      70       107
    1989     379     253      83        91

Returns:	A `corpkit.interrogation.Interrogation`

topwords(datatype='n', n=10, df=False, sort=True, precision=2)[source]¶

Show top n results in each corpus alongside absolute or relative frequencies.

Parameters:	datatype (str (n/k/%)) – show abs/rel frequencies, or keyness n (int) – number of result to show df (bool) – return a DataFrame sort (bool) – Sort results, or show as is precision (int) – float precision to show
Example:

>>> data.topwords(n=5)
    TBT            %   UST            %   WAP            %   WSJ            %
    health     25.70   health     15.25   health     19.64   credit      9.22
    security    6.48   cancer     10.85   security    7.91   health      8.31
    cancer      6.19   heart       6.31   cancer      6.55   downside    5.46
    flight      4.45   breast      4.29   credit      4.08   inflation   3.37
    safety      3.49   security    3.94   safety      3.26   cancer      3.12

Returns:	None

visualise(shape='auto', truncate=8, **kwargs)[source]¶

Attempt to visualise Interrodict by using subplots

Parameters:	shape (tuple) – Layout for the subplots (e.g. (2, 2)) truncate (int) – Only process the first n items in the class:corpkit.interrogation.Interrodict kwargs (keyword arguments) – specifications to pass to `plotter()`

copy()[source]¶

flip(truncate=30, transpose=True, repeat=False, *args, **kwargs)[source]¶

Change the dimensions of corpkit.interrogation.Interrodict, making column names into keys.

Parameters:	truncate (int/‘all’) – Get first n columns transpose (bool) – Flip rows and columns: repeat (bool) – Flip twice, to move columns into key position kwargs – Arguments to pass to the `edit()` method
Returns:	`corpkit.interrogation.Interrodict`

get_totals()[source]¶: Helper function to concatenate all totals

Concordance¶

class corpkit.interrogation.Concordance(data)[source]¶

Bases: pandas.core.frame.DataFrame

A class for concordance lines, with methods for saving, formatting and editing.

format(kind='string', n=100, window=35, print_it=True, columns='all', metadata=True, **kwargs)[source]¶

Print concordance lines nicely, to string, LaTeX or CSV

Parameters:	kind (str) – output format: string/latex/csv n (int/‘all’) – Print first n lines only window (int) – how many characters to show to left and right columns (list) – which columns to show
Example:

>>> lines = corpus.concordance({T: r'/NN.?/ >># NP'}, show=L)
### show 25 characters either side, 4 lines, just text columns
>>> lines.format(window=25, n=4, columns=[L,M,R])
    0                  we 're in  tucson     , then up north to flagst
    1  e 're in tucson , then up  north      to flagstaff , then we we
    2  tucson , then up north to  flagstaff  , then we went through th
    3   through the grand canyon  area       and then phoenix and i sp

Returns:	None

calculate()[source]¶: Make new Interrogation object from (modified) concordance lines

shuffle(inplace=False)[source]¶

Shuffle concordance lines

Parameters:	inplace (bool) – Modify current object, or create a new one
Example:

>>> lines[:4].shuffle()
01  1-01.txt.conll   through the grand canyon  area       and then phoenix and i sp
01  1-01.txt.conll  e 're in tucson , then up  north      to flagstaff , then we we
01  1-01.txt.conll                  we 're in  tucson     , then up north to flagst
01  1-01.txt.conll  tucson , then up north to  flagstaff  , then we went through th

edit(*args, **kwargs)[source]¶

Delete or keep rows by subcorpus or by middle column text.

>>> skipped = conc.edit(skip_entries=r'to_?match')

less(**kwargs)[source]¶