code package

Submodules

code.browser_output module

browser_output.py

code.browser_output.content_formatter(lines)

Returns the browser output and opens in the default browser of the system

Parameters

lines – The result file contents line by line

Returns

The browser output in HTML form

code.browser_output.output_formatter()

Returns the browser output and opens in the default browser of the system :return: The browser output in HTML form

code.browser_output.result_display(content, wordcloud_image_name)

Returns the browser output and opens in the default browser of the system

Parameters
  • content – The result file contents in html form

  • wordcloud_image_name – The name for word cloud image

Returns

The browser output in HTML form

code.extract_sizes module

File completing step 2: given a pdf document, return a dictionary of headers and paragraphs

code.extract_sizes.extract_words(file: str) dict

Given a filename, opens the PDF and extracts words and metadata from each slide.

Parameters

file – String representing file path

Type

string

Return type

dict

Returns

dictionary representing document metadata and words extracted from each slide

code.extract_sizes.get_sizes(doc: dict) list

Helper function to get unique sizes within a PDF

Parameters

doc – The list of blocks within a PDF

Type

list

Return type

list

Returns

a list of unique font sizes

code.extract_sizes.tag_text(unique_fonts: list, doc: dict) list

Categorizes each text into either Heading or paragraph. Heading includes the top 2 sizes, either title or main heading. Paragraph contains all other sizes

Parameters
  • unique_fonts (list) – a list of unique fonts in the powerpoint

  • doc (dict) – a list of blocks per each document page

Return type

list

Returns

a list of dictionaries categorizing each text into its respective category

code.extract_sizes.text_to_groupings(doc: dict) list

Given a pdf document, returns a dictionary of Headers, Paragraphs, and page number

Parameters

doc – a PDF document containing only words

Type

dict

Return type

list

Returns

a dictionary categorizing each text into its respective category

code.user_cli module

user_cli.py

code.user_cli.generate_wordcloud(data: list, file_name: str) None

Given keywords of a document, display a wordcloud.

Parameters
  • data – List of cleaned keywords in a document

  • file_name – The name of the lecture document

Type

list

Type

str

Return type

None

Returns

None

code.user_cli.user_menu()

Runner class. Prompts the user for input and returns a txt file of results

code.wordprocessing module

wordprocessing.py

code.wordprocessing.construct_search_query(data: list) list

Constructs a search query given a PDF data

Parameters

data – The list of data

Type

list

Returns

List of words to search

Return type

list

code.wordprocessing.duplicate_word_removal(data: list) list

Function to remove duplicate words

Parameters

data – The list of dictionaries of the form

Type

[{“Header”:””, “Header_keywords”: [], “Paragraph_keywords”: [], slides:[int]}]

Returns

The list of dictionaries with duplicate keywords removed of the form

Return type

[{“Header”:””, “Header_keywords”: [], “Paragraph_keywords”: [], slides:[int]}]

code.wordprocessing.extract_noun_chunks(data: list) list

Extracts nouns using Spacy

Parameters

data – list of PDF data

Type

list

Returns

list of data with nouns extracted

Return type

list

code.wordprocessing.keyword_extractor(data: list) list

Function to extract keywords from the headers and paragraphs of slides

Parameters

data – The list of dictionaries of the form

Type

[{“Header”:””, “Paragraph”:””, slide:int}]

Returns

The list of dictionaries with keywords extracted of the form

Return type

[{“Header”:””, “Paragraph”:””, “Header_keywords”: [], “Paragraph_keywords”: [], slide:int}]

code.wordprocessing.merge_slide_with_same_headers(data: list) list

Function to merge slides with the same header.

Parameters

data – The list of dictionaries of the form

Type

[{“Header”:””, “Paragraph”:””, “Header_keywords”: [], “Paragraph_keywords”: [], slide:int}]

Returns

The list of dictionaries where slides containing the same header are merged

Return type

[{“Header”:””, “Header_keywords”: [], “Paragraph_keywords”: [], slides:[int]}]

code.wordprocessing.merge_slide_with_same_slide_number(data: list) list

Function to merge slides with the same slide number into a single one.

Parameters

data – The list of dictionaries of the form

Type

[{“Header”:””, “Paragraph”:””, “Header_keywords”: [], “Paragraph_keywords”: [], slide:int}]

Returns

The list of dictionaries where slides containing the same slide number are merged

Return type

[{“Header”:””, “Header_keywords”: [], “Paragraph_keywords”: [], slide:int}]

Module contents