Interrogation classes

Once you have searched a Corpus object, you’ll want to be able to edit, visualise and store results. Remember that upon importing corpkit, any pandas.DataFrame or pandas.Series object is monkey-patched with save, edit and visualise methods.

Interrogation

class corpkit.interrogation.Interrogation(results=None, totals=None, query=None, concordance=None)[source]

Bases: object

Stores results of a corpus interrogation, before or after editing. The main attribute, results, is a Pandas object, which can be edited or plotted.

results = None

pandas DataFrame containing counts for each subcorpus

totals = None

pandas Series containing summed results

query = None

dict containing values that generated the result

concordance = None

pandas DataFrame containing concordance lines, if concordance lines were requested.

edit(*args, **kwargs)[source]

Manipulate results of interrogations.

There are a few overall kinds of edit, most of which can be combined into a single function call. It’s useful to keep in mind that many are basic wrappers around pandas operations—if you’re comfortable with pandas syntax, it may be faster at times to use its syntax instead.

Basic mathematical operations:
 

First, you can do basic maths on results, optionally passing in some data to serve as the denominator. Very commonly, you’ll want to get relative frequencies:

Example:
>>> data = corpus.interrogate({W: r'^t'})
>>> rel = data.edit('%', SELF)
>>> rel.results
    ..    to  that   the  then ...   toilet  tolerant  tolerate  ton
    01 18.50 14.65 14.44  6.20 ...     0.00      0.00      0.11 0.00
    02 24.10 14.34 13.73  8.80 ...     0.00      0.00      0.00 0.00
    03 17.31 18.01  9.97  7.62 ...     0.00      0.00      0.00 0.00

For the operation, there are a number of possible values, each of which is to be passed in as a str:

+, -, /, *, %: self explanatory

k: calculate keywords

a: get distance metric

SELF is a very useful shorthand denominator. When used, all editing is performed on the data. The totals are then extracted from the edited data, and used as denominator. If this is not the desired behaviour, however, a more specific interrogation.results or interrogation.totals attribute can be used.

In the example above, SELF (or ‘self’) is equivalent to:

Example:
>>> rel = data.edit('%', data.totals)
Keeping and skipping data:
 

There are four keyword arguments that can be used to keep or skip rows or columns in the data:

  • just_entries
  • just_subcorpora
  • skip_entries
  • skip_subcorpora

Each can accept different input types:

  • str: treated as regular expression to match
  • list:
    • of integers: indices to match
    • of strings: entries/subcorpora to match
Example:
>>> data.edit(just_entries=r'^fr', 
...           skip_entries=['free','freedom'],
...           skip_subcorpora=r'[0-9]')
Merging data:

There are also keyword arguments for merging entries and subcorpora:

  • merge_entries
  • merge_subcorpora

These take a dict, with the new name as key and the criteria as value. The criteria can be a str (regex) or wordlist.

Example:
>>> from dictionaries.wordlists import wordlists
>>> mer = {'Articles': ['the', 'an', 'a'], 'Modals': wordlists.modals}
>>> data.edit(merge_entries=mer)
Sorting:

The sort_by keyword argument takes a str, which represents the way the result columns should be ordered.

  • increase: highest to lowest slope value
  • decrease: lowest to highest slope value
  • turbulent: most change in y axis values
  • static: least change in y axis values
  • total/most: largest number first
  • infreq/least: smallest number first
  • name: alphabetically
Example:
>>> data.edit(sort_by='increase')
Editing text:

Column labels, corresponding to individual interrogation results, can also be edited with replace_names.

Parameters:replace_names (str/list of tuples/dict) – Edit result names, then merge duplicate entries

If replace_names is a string, it is treated as a regex to delete from each name. If replace_names is a dict, the value is the regex, and the key is the replacement text. Using a list of tuples in the form (find, replacement) allows duplicate substitution values.

Example:
>>> data.edit(replace_names={r'object': r'[di]obj'})
Parameters:replace_subcorpus_names (str/list of tuples/dict) – Edit subcorpus names, then merge duplicates. The same as replace_names, but on the other axis.
Other options:

There are many other miscellaneous options.

Parameters:
  • keep_stats (bool) – Keep/drop stats values from dataframe after sorting
  • keep_top (int) – After sorting, remove all but the top keep_top results
  • just_totals (bool) – Sum each column and work with sums
  • threshold (int/bool) –
    When using results list as dataframe 2, drop values
    occurring fewer than n times. If not keywording, you can use:

    ‘high’: denominator total / 2500

    ‘medium’: denominator total / 5000

    ‘low’: denominator total / 10000

    If keywording, there are smaller default thresholds

  • span_subcorpora (tuple(int, int2)) – If subcorpora are numerically named, span all from int to int2, inclusive
  • projection (tuple – (subcorpus_name, n)) – multiply results in subcorpus by n
  • remove_above_p (bool) – Delete any result over p
  • p (float) – set the p value
  • revert_year (bool) – When doing linear regression on years, turn annual subcorpora into 1, 2 ...
  • print_info (bool) – Print stuff to console showing what’s being edited
  • spelling (str‘US’/‘UK’) – Convert/normalise spelling:
Keywording options:
 

If the operation is k, you’re calculating keywords. In this case, some other keyword arguments have an effect:

Parameters:keyword_measure

what measure to use to calculate keywords:

ll: log-likelihood `pd’: percentage difference

type keyword_measure: str

Parameters:
  • selfdrop (bool) – When keywording, try to remove target corpus from reference corpus
  • calc_all (bool) – When keywording, calculate words that appear in either corpus
Returns:

corpkit.interrogation.Interrogation

sort(way, **kwargs)[source]
visualise(title='', x_label=None, y_label=None, style='ggplot', figsize=(8, 4), save=False, legend_pos='best', reverse_legend='guess', num_to_plot=7, tex='try', colours='Accent', cumulative=False, pie_legend=True, rot=False, partial_pie=False, show_totals=False, transparent=False, output_format='png', interactive=False, black_and_white=False, show_p_val=False, indices=False, transpose=False, **kwargs)[source]

Visualise corpus interrogations using matplotlib.

Example:
>>> data.visualise('An example plot', kind='bar', save=True)
<matplotlib figure>
Parameters:
  • title (str) – A title for the plot
  • x_label (str) – A label for the x axis
  • y_label (str) – A label for the y axis
  • kind (str (‘line’/‘bar’/‘barh’/‘pie’/‘area’/‘heatmap’)) – The kind of chart to make
  • style (str (‘ggplot’/’bmh’/’fivethirtyeight’/’seaborn-talk’/etc)) – Visual theme of plot
  • figsize (tuple(int, int)) – Size of plot
  • save (bool/str) – If bool, save with title as name; if str, use str as name
  • legend_pos (str (‘upper right’/’outside right’/etc)) – Where to place legend
  • reverse_legend (bool) – Reverse the order of the legend
  • num_to_plot (int/’all’) – How many columns to plot
  • tex (bool) – Use TeX to draw plot text
  • colours (str) – Colourmap for lines/bars/slices
  • cumulative (bool) – Plot values cumulatively
  • pie_legend (bool) – Show a legend for pie chart
  • partial_pie (bool) – Allow plotting of pie slices only
  • show_totals (str – ‘legend’/’plot’/’both’) – Print sums in plot where possible
  • transparent (bool) – Transparent .png background
  • output_format (str – ‘png’/’pdf’) – File format for saved image
  • black_and_white (bool) – Create black and white line styles
  • show_p_val (bool) – Attempt to print p values in legend if contained in df
  • indices (bool) – To use when plotting “distance from root”
  • stacked (str) – When making bar chart, stack bars on top of one another
  • filled (str) – For area and bar charts, make every column sum to 100
  • legend (bool) – Show a legend
  • rot (int) – Rotate x axis ticks by rot degrees
  • subplots (bool) – Plot each column separately
  • layout (tuple(int, int)) – Grid shape to use when subplots is True
  • interactive (list[1, 2, 3]) – Experimental interactive options
Returns:

matplotlib figure

multiplot(leftdict={}, rightdict={}, **kwargs)[source]
language_model(name, *args, **kwargs)[source]

Make a language model from an Interrogation. This is usually done directly on a corpkit.corpus.Corpus object with the make_language_model() method.

save(savename, savedir='saved_interrogations', **kwargs)[source]

Save an interrogation as pickle to savedir.

Example:
>>> o = corpus.interrogate(W, 'any')
### create ./saved_interrogations/savename.p
>>> o.save('savename')
Parameters:
  • savename (str) – A name for the saved file
  • savedir (str) – Relative path to directory in which to save file
  • print_info (bool) – Show/hide stdout
Returns:

None

quickview(n=25)[source]

view top n results as painlessly as possible.

Example:
>>> data.quickview(n=5)
    0: to    (n=2227)
    1: that  (n=2026)
    2: the   (n=1302)
    3: then  (n=857)
    4: think (n=676)
Parameters:n (int) – Show top n results
Returns:None
tabview(**kwargs)[source]
asciiplot(row_or_col_name, axis=0, colours=True, num_to_plot=100, line_length=120, min_graph_length=50, separator_length=4, multivalue=False, human_readable='si', graphsymbol='*', float_format='{:, .2f}', **kwargs)[source]

A very quick ascii chart for result

rel(denominator='self', **kwargs)[source]
keyness(measure='ll', denominator='self', **kwargs)[source]
multiindex(indexnames=None)[source]

Create a pandas.MultiIndex object from slash-separated results.

Example:
>>> data = corpus.interrogate({W: 'st$'}, show=[L, F])
>>> data.results
    ..  just/advmod  almost/advmod  last/amod 
    01           79             12          6 
    02          105              6          7 
    03           86             10          1 
>>> data.multiindex().results
    Lemma       just almost last first   most 
    Function  advmod advmod amod  amod advmod 
    0             79     12    6     2      3 
    1            105      6    7     1      3 
    2             86     10    1     3      0                                   
Parameters:indexnames (list of strings) – provide custom names for the new index, or leave blank to guess.
Returns:corpkit.interrogation.Interrogation, with pandas.MultiIndex as

results attribute

topwords(datatype='n', n=10, df=False, sort=True, precision=2)[source]

Show top n results in each corpus alongside absolute or relative frequencies.

Parameters:
  • datatype (str (n/k/%)) – show abs/rel frequencies, or keyness
  • n (int) – number of result to show
  • df (bool) – return a DataFrame
  • sort (bool) – Sort results, or show as is
  • precision (int) – float precision to show
Example:
>>> data.topwords(n=5)
    1987           %   1988           %   1989           %   1990           %
    health     25.70   health     15.25   health     19.64   credit      9.22
    security    6.48   cancer     10.85   security    7.91   health      8.31
    cancer      6.19   heart       6.31   cancer      6.55   downside    5.46
    flight      4.45   breast      4.29   credit      4.08   inflation   3.37
    safety      3.49   security    3.94   safety      3.26   cancer      3.12
Returns:None
perplexity()[source]

Pythonification of the formal definition of perplexity.

input: a sequence of chances (any iterable will do) output: perplexity value.

from https://github.com/zeffii/NLP_class_notes

entropy()[source]

entropy(pos.edit(merge_entries=mergetags, sort_by=’total’).results.T

shannon()[source]

Interrodict

class corpkit.interrogation.Interrodict(data)[source]

Bases: collections.OrderedDict

A class for interrogations that do not fit in a single-indexed DataFrame.

Individual interrogations can be looked up via dict keys, indexes or attributes:

Example:
>>> out_data['WSJ'].results
>>> out_data.WSJ.results
>>> out_data[3].results

Methods for saving, editing, etc. are similar to corpkit.corpus.Interrogation. Additional methods are available for collapsing into single (multi-indexed) DataFrames.

This class is now deprecated, in favour of a multiindexed DataFrame.

edit(*args, **kwargs)[source]

Edit each value with edit().

See edit() for possible arguments.

Returns:A corpkit.interrogation.Interrodict
multiindex(indexnames=False)[source]

Create a pandas.MultiIndex version of results.

Example:
>>> d = corpora.interrogate({F: 'compound', GL: '^risk'}, show=L)
>>> d.keys()
    ['CHT', 'WAP', 'WSJ']
>>> d['CHT'].results
    ....  health  cancer  security  credit  flight  safety  heart
    1987      87      25        28      13       7       6      4
    1988      72      24        20      15       7       4      9
    1989     137      61        23      10       5       5      6
>>> d.multiindex().results
    ...               health  cancer  credit  security  downside  
    Corpus Subcorpus                                             
    CHT    1987           87      25      13        28        20 
           1988           72      24      15        20        12 
           1989          137      61      10        23        10                                                      
    WAP    1987           83      44       8        44        10 
           1988           83      27      13        40         6 
           1989           95      77      18        25        12 
    WSJ    1987           52      27      33         4        21 
           1988           39      11      37         9        22 
           1989           55      47      43         9        24 
Returns:A corpkit.interrogation.Interrogation
save(savename, savedir='saved_interrogations', **kwargs)[source]

Save an interrogation as pickle to savedir.

Parameters:
  • savename (str) – A name for the saved file
  • savedir (str) – Relative path to directory in which to save file
  • print_info (bool) – Show/hide stdout
Example:
>>> o = corpus.interrogate(W, 'any')
### create ``saved_interrogations/savename.p``
>>> o.save('savename')
Returns:None
collapse(axis='y')[source]

Collapse Interrodict on an axis or along interrogation name.

Parameters:axis (str: x/y/n) – collapse along x, y or name axis
Example:
>>> d = corpora.interrogate({F: 'compound', GL: r'^risk'}, show=L)

>>> d.keys()
    ['CHT', 'WAP', 'WSJ']

>>> d['CHT'].results
    ....  health  cancer  security  credit  flight  safety  heart
    1987      87      25        28      13       7       6      4
    1988      72      24        20      15       7       4      9
    1989     137      61        23      10       5       5      6

>>> d.collapse().results
    ...  health  cancer  credit  security
    CHT    3174    1156     566       697
    WAP    2799     933     582      1127
    WSJ    1812     680    2009       537

>>> d.collapse(axis='x').results
    ...  1987  1988  1989
    CHT   384   328   464
    WAP   389   355   435
    WSJ   428   410   473

>>> d.collapse(axis='key').results
    ...   health  cancer  credit  security
    1987     282     127      65        93
    1988     277     100      70       107
    1989     379     253      83        91
Returns:A corpkit.interrogation.Interrogation
topwords(datatype='n', n=10, df=False, sort=True, precision=2)[source]

Show top n results in each corpus alongside absolute or relative frequencies.

Parameters:
  • datatype (str (n/k/%)) – show abs/rel frequencies, or keyness
  • n (int) – number of result to show
  • df (bool) – return a DataFrame
  • sort (bool) – Sort results, or show as is
  • precision (int) – float precision to show
Example:
>>> data.topwords(n=5)
    TBT            %   UST            %   WAP            %   WSJ            %
    health     25.70   health     15.25   health     19.64   credit      9.22
    security    6.48   cancer     10.85   security    7.91   health      8.31
    cancer      6.19   heart       6.31   cancer      6.55   downside    5.46
    flight      4.45   breast      4.29   credit      4.08   inflation   3.37
    safety      3.49   security    3.94   safety      3.26   cancer      3.12
Returns:None
visualise(shape='auto', truncate=8, **kwargs)[source]

Attempt to visualise Interrodict by using subplots

Parameters:
  • shape (tuple) – Layout for the subplots (e.g. (2, 2))
  • truncate (int) – Only process the first n items in the class:corpkit.interrogation.Interrodict
  • kwargs (keyword arguments) – specifications to pass to plotter()
copy()[source]
flip(truncate=30, transpose=True, repeat=False, *args, **kwargs)[source]

Change the dimensions of corpkit.interrogation.Interrodict, making column names into keys.

Parameters:
  • truncate (int/‘all’) – Get first n columns
  • transpose (bool) – Flip rows and columns:
  • repeat (bool) – Flip twice, to move columns into key position
  • kwargs – Arguments to pass to the edit() method
Returns:

corpkit.interrogation.Interrodict

get_totals()[source]

Helper function to concatenate all totals

Concordance

class corpkit.interrogation.Concordance(data)[source]

Bases: pandas.core.frame.DataFrame

A class for concordance lines, with methods for saving, formatting and editing.

format(kind='string', n=100, window=35, print_it=True, columns='all', metadata=True, **kwargs)[source]

Print concordance lines nicely, to string, LaTeX or CSV

Parameters:
  • kind (str) – output format: string/latex/csv
  • n (int/‘all’) – Print first n lines only
  • window (int) – how many characters to show to left and right
  • columns (list) – which columns to show
Example:
>>> lines = corpus.concordance({T: r'/NN.?/ >># NP'}, show=L)
### show 25 characters either side, 4 lines, just text columns
>>> lines.format(window=25, n=4, columns=[L,M,R])
    0                  we 're in  tucson     , then up north to flagst
    1  e 're in tucson , then up  north      to flagstaff , then we we
    2  tucson , then up north to  flagstaff  , then we went through th
    3   through the grand canyon  area       and then phoenix and i sp
Returns:None
calculate()[source]

Make new Interrogation object from (modified) concordance lines

shuffle(inplace=False)[source]

Shuffle concordance lines

Parameters:inplace (bool) – Modify current object, or create a new one
Example:
>>> lines[:4].shuffle()
    3  01  1-01.txt.conll   through the grand canyon  area       and then phoenix and i sp
    1  01  1-01.txt.conll  e 're in tucson , then up  north      to flagstaff , then we we
    0  01  1-01.txt.conll                  we 're in  tucson     , then up north to flagst
    2  01  1-01.txt.conll  tucson , then up north to  flagstaff  , then we went through th
edit(*args, **kwargs)[source]

Delete or keep rows by subcorpus or by middle column text.

>>> skipped = conc.edit(skip_entries=r'to_?match')
less(**kwargs)[source]