第 27 章办公自动化
上一页		下一页

第 27 章办公自动化

27.1. Python 处理 PDF 文件

27.1.1. Word 转 PDF

安装

		
pip install docx2pdf

macOS 还需要安装

		
brew install aljohri/-/docx2pdf

命令行演示

查看帮助信息 docx2pdf --help

			
neo@MacBook-Pro-Neo ~ % docx2pdf
usage: docx2pdf [-h] [--keep-active] [--version] input [output]

Example Usage:

Convert single docx file in-place from myfile.docx to myfile.pdf:
    docx2pdf myfile.docx

Batch convert docx folder in-place. Output PDFs will go in the same folder:
    docx2pdf myfolder/

Convert single docx file with explicit output filepath:
    docx2pdf input.docx output.docx

Convert single docx file and output to a different explicit folder:
    docx2pdf input.docx output_dir/

Batch convert docx folder. Output PDFs will go to a different explicit folder:
    docx2pdf input_dir/ output_dir/

positional arguments:
  input          input file or folder. batch converts entire folder or convert single file
  output         output file or folder

optional arguments:
  -h, --help     show this help message and exit
  --keep-active  prevent closing word after conversion
  --version      display version and exit

代码演示

		
from docx2pdf import convert

convert("input.docx")
convert("input.docx", "output.pdf")
convert("my_docx_folder/")

如果只是转换一个文档，我们就没有必要用Python了。下面的程序是批量转换指定目录下的 Word 文档。

		
import os
import glob
from pathlib import Path

# 指定转换目录
path = os.getcwd() + '/' 

p = Path(path) 
files=list(p.glob("**/*.docx")) 

for file in files:
    convert(file,f"{file}.pdf")

27.1.2. 提取 PDF 文件中的文字和表格

Plumb a PDF for detailed information about each char, rectangle, and line.

27.1.2.1. 安装 pdfplumber

		
neo@MacBook-Pro-Neo ~/workspace/python % pip install pdfplumber

查看 pdfplumber 是否安装成功

		
neo@MacBook-Pro-Neo ~/workspace/python % pip show pdfplumber
Name: pdfplumber
Version: 0.5.26
Summary: Plumb a PDF for detailed information about each char, rectangle, and line.
Home-page: https://github.com/jsvine/pdfplumber
Author: Jeremy Singer-Vine
Author-email: jsvine@gmail.com
License: UNKNOWN
Location: /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages
Requires: pdfminer.six, Pillow, Wand
Required-by:

27.1.2.2. 获取PDF文档信息

		
import os,pdfplumber
import pandas as pd

file = os.path.expanduser("~/tmp/每日开放式基金净值表.pdf")

with pdfplumber.open(file) as pdf:
    print(pdf.metadata)

输出结果

		
{'Producer': 'macOS 版本11.2.1（版号20D74） Quartz PDFContext', 'CreationDate': "D:20210227145013Z00'00'", 'ModDate': "D:20210227145013Z00'00'"}

27.1.2.3. 获取PDF总页数

		
import os,pdfplumber
import pandas as pd

file = os.path.expanduser("~/tmp/每日开放式基金净值表.pdf")

with pdfplumber.open(file) as pdf:
    print(len(pdf.pages))

27.1.2.4. 查看PDF页面信息

		
import os,pdfplumber
import pandas as pd

file = os.path.expanduser("~/tmp/每日开放式基金净值表.pdf")

with pdfplumber.open(file) as pdf:

    first_page = pdf.pages[0]
    
    # 查看当前页码
    print('页码：',first_page.page_number)

    # 查看当前页宽
    print('页宽：',first_page.width)

    # 查看当前页高
    print('页高：',first_page.height)

输出结果

		
neo@MacBook-Pro-Neo ~/workspace/python % python3 pdf.py
页码： 1
页宽： 1324
页高： 7638

27.1.2.5. 提取文本内容

		
import os,pdfplumber
import pandas as pd

file = os.path.expanduser("~/tmp/每日开放式基金净值表.pdf")

with pdfplumber.open(file) as pdf:

    first_page = pdf.pages[0]
    
    # 读取文本
    text = first_page.extract_text()
    print(text)

27.1.2.6. 提取pdf中的表格数据

		
import os,pdfplumber
import pandas as pd

file = os.path.expanduser("~/tmp/每日开放式基金净值表.pdf")

with pdfplumber.open(file) as pdf:
    first_page = pdf.pages[0]
    table = first_page.extract_table()
    print(table)

		
import os,pdfplumber
import pandas as pd

file = os.path.expanduser("~/tmp/每日开放式基金净值表.pdf")

with pdfplumber.open(file) as pdf:

    first_page = pdf.pages[0]
    table = first_page.extract_table()
    # print(table)
    for t in table:
        print(t)

		
neo@MacBook-Pro-Neo ~/workspace/python % python3 pdf.py
['关注', '⽐较', '序号', '基⾦代码', '基⾦简称', '2021-02-26', None, '2021-02-25', None, '⽇增长值', '⽇增长率', '申购状态', '赎回状态', '⼿续费']
[None, None, None, None, None, '单位净值', '累计净值', '单位净值', '累计净值', None, None, None, None, None]
['', '', '1', '501030', '汇添富中证环境治理指数A 估值图 基⾦吧', '0.5501', '0.5501', '0.5421', '0.5421', '0.0080', '1.48%', '开放', '开放', '0.12%']
['', '', '2', '501031', '汇添富中证环境治理指数C 估值图 基⾦吧', '0.5471', '0.5471', '0.5392', '0.5392', '0.0079', '1.47%', '开放', '开放', '0.00%']
['', '', '3', '164908', '交银中证环境治理(LOF) 估值图 基⾦吧', '0.4890', '0.4890', '0.4820', '0.4820', '0.0070', '1.45%', '开放', '开放', '0.12%']
['', '', '4', '004005', '东⽅民丰回报赢安混合A 估值图 基⾦吧', '1.0564', '1.0709', '1.0438', '1.0583', '0.0126', '1.21%', '开放', '开放', '0.06%']
['', '', '5', '004006', '东⽅民丰回报赢安混合C 估值图 基⾦吧', '1.0463', '1.0593', '1.0338', '1.0468', '0.0125', '1.21%', '开放', '开放', '0.00%']

27.1.2.7. 保存数据到 Excel

		
neo@MacBook-Pro-Neo ~/workspace/python % pip install openpyxl

		
import os,pdfplumber

import pandas as pd

file = os.path.expanduser("~/tmp/每日开放式基金净值表.pdf")

# 读取pdf文件，保存为excel
with pdfplumber.open(file) as pdf:
    first_page = pdf.pages[0]

    # 自动读取表格信息，返回列表
    table = first_page.extract_table()
    # print(table)

    save = pd.DataFrame(table[2:],columns=table[0])
    # 保存excel
    save.to_excel('output.xlsx')

上一页		下一页
26.4. 容器	起始页	27.2. Word 文字处理

第 27 章 办公自动化

27.1. Python 处理 PDF 文件

27.1.1. Word 转 PDF

27.1.2. 提取 PDF 文件中的文字和表格

Plumb a PDF for detailed information about each char, rectangle, and line.

27.1.2.1. 安装 pdfplumber

27.1.2.2. 获取PDF文档信息

27.1.2.3. 获取PDF总页数

27.1.2.4. 查看PDF页面信息

27.1.2.5. 提取文本内容

27.1.2.6. 提取pdf中的表格数据

27.1.2.7. 保存数据到 Excel

27.1.3. PyPDF2

第 27 章办公自动化