Home | 简体中文 | 繁体中文 | 杂文 | Github | 知乎专栏 | 51CTO学院 | CSDN程序员研修院 | OSChina 博客 | 腾讯云社区 | 阿里云栖社区 | Facebook | Linkedin | Youtube | 打赏(Donations) | About
知乎专栏多维度架构

第 24 章 办公自动化

目录

24.1. Python 处理 PDF 文件
24.1.1. Word 转 PDF
24.1.2. 提取 PDF 文件中的文字和表格
24.1.3. PyPDF2
24.2. Word 文字处理
24.2.1. 安装
24.2.2. 创建空白文档
24.2.3. 添加标题
24.2.4. 添加段落
24.2.5. 列表
24.2.6. 表格
24.2.7. 添加图片
24.2.8. 强制分页
24.2.9. 样式
24.2.10. 演示例子
24.2.11. 另存操作
24.2.12. 读取 Word 文档
24.2.13. Word 模版合并
24.3. Python 处理 Excel
24.3.1. openpyxl - A Python library to read/write Excel 2010 xlsx/xlsm files
24.3.2. xlrd/xlwt/xlutils
24.3.3. xlwings

24.1. Python 处理 PDF 文件

24.1.1. Word 转 PDF

安装

		
pip install docx2pdf		
		
		

macOS 还需要安装

		
brew install aljohri/-/docx2pdf
		
		

命令行演示

查看帮助信息 docx2pdf --help

			
neo@MacBook-Pro-Neo ~ % docx2pdf
usage: docx2pdf [-h] [--keep-active] [--version] input [output]

Example Usage:

Convert single docx file in-place from myfile.docx to myfile.pdf:
    docx2pdf myfile.docx

Batch convert docx folder in-place. Output PDFs will go in the same folder:
    docx2pdf myfolder/

Convert single docx file with explicit output filepath:
    docx2pdf input.docx output.docx

Convert single docx file and output to a different explicit folder:
    docx2pdf input.docx output_dir/

Batch convert docx folder. Output PDFs will go to a different explicit folder:
    docx2pdf input_dir/ output_dir/

positional arguments:
  input          input file or folder. batch converts entire folder or convert single file
  output         output file or folder

optional arguments:
  -h, --help     show this help message and exit
  --keep-active  prevent closing word after conversion
  --version      display version and exit
		
		

代码演示

		
from docx2pdf import convert

convert("input.docx")
convert("input.docx", "output.pdf")
convert("my_docx_folder/")
		
		

如果只是转换一个文档,我们就没有必要用Python了。下面的程序是批量转换指定目录下的 Word 文档。

		
import os
import glob
from pathlib import Path

# 指定转换目录
path = os.getcwd() + '/' 

p = Path(path) 
files=list(p.glob("**/*.docx")) 

for file in files:
    convert(file,f"{file}.pdf")		
		
		

24.1.2. 提取 PDF 文件中的文字和表格

Plumb a PDF for detailed information about each char, rectangle, and line.

24.1.2.1. 安装 pdfplumber

		
neo@MacBook-Pro-Neo ~/workspace/python % pip install pdfplumber		
		
			

查看 pdfplumber 是否安装成功

		
neo@MacBook-Pro-Neo ~/workspace/python % pip show pdfplumber
Name: pdfplumber
Version: 0.5.26
Summary: Plumb a PDF for detailed information about each char, rectangle, and line.
Home-page: https://github.com/jsvine/pdfplumber
Author: Jeremy Singer-Vine
Author-email: jsvine@gmail.com
License: UNKNOWN
Location: /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages
Requires: pdfminer.six, Pillow, Wand
Required-by: 		
		
			

24.1.2.2. 获取PDF文档信息

		
import os,pdfplumber
import pandas as pd

file = os.path.expanduser("~/tmp/每日开放式基金净值表.pdf")

with pdfplumber.open(file) as pdf:
    print(pdf.metadata)		
		
			

输出结果

		
{'Producer': 'macOS 版本11.2.1(版号20D74) Quartz PDFContext', 'CreationDate': "D:20210227145013Z00'00'", 'ModDate': "D:20210227145013Z00'00'"}		
		
			

24.1.2.3. 获取PDF总页数

		
import os,pdfplumber
import pandas as pd

file = os.path.expanduser("~/tmp/每日开放式基金净值表.pdf")

with pdfplumber.open(file) as pdf:
    print(len(pdf.pages))		
		
			

24.1.2.4. 查看PDF页面信息

		
import os,pdfplumber
import pandas as pd

file = os.path.expanduser("~/tmp/每日开放式基金净值表.pdf")

with pdfplumber.open(file) as pdf:

    first_page = pdf.pages[0]
    
    # 查看当前页码
    print('页码:',first_page.page_number)

    # 查看当前页宽
    print('页宽:',first_page.width)

    # 查看当前页高
    print('页高:',first_page.height)		
		
			

输出结果

		
neo@MacBook-Pro-Neo ~/workspace/python % python3 pdf.py
页码: 1
页宽: 1324
页高: 7638		
		
			

24.1.2.5. 提取文本内容

		
import os,pdfplumber
import pandas as pd

file = os.path.expanduser("~/tmp/每日开放式基金净值表.pdf")

with pdfplumber.open(file) as pdf:

    first_page = pdf.pages[0]
    
    # 读取文本
    text = first_page.extract_text()
    print(text)		
		
			

24.1.2.6. 提取pdf中的表格数据

		
import os,pdfplumber
import pandas as pd

file = os.path.expanduser("~/tmp/每日开放式基金净值表.pdf")

with pdfplumber.open(file) as pdf:
    first_page = pdf.pages[0]
    table = first_page.extract_table()
    print(table)	
		
			

		
import os,pdfplumber
import pandas as pd

file = os.path.expanduser("~/tmp/每日开放式基金净值表.pdf")

with pdfplumber.open(file) as pdf:

    first_page = pdf.pages[0]
    table = first_page.extract_table()
    # print(table)
    for t in table:
        print(t)		
		
			
		
neo@MacBook-Pro-Neo ~/workspace/python % python3 pdf.py
['关注', '⽐较', '序号', '基⾦代码', '基⾦简称', '2021-02-26', None, '2021-02-25', None, '⽇增长值', '⽇增长率', '申购状态', '赎回状态', '⼿续费']
[None, None, None, None, None, '单位净值', '累计净值', '单位净值', '累计净值', None, None, None, None, None]
['', '', '1', '501030', '汇添富中证环境治理指数A 估值图 基⾦吧', '0.5501', '0.5501', '0.5421', '0.5421', '0.0080', '1.48%', '开放', '开放', '0.12%']
['', '', '2', '501031', '汇添富中证环境治理指数C 估值图 基⾦吧', '0.5471', '0.5471', '0.5392', '0.5392', '0.0079', '1.47%', '开放', '开放', '0.00%']
['', '', '3', '164908', '交银中证环境治理(LOF) 估值图 基⾦吧', '0.4890', '0.4890', '0.4820', '0.4820', '0.0070', '1.45%', '开放', '开放', '0.12%']
['', '', '4', '004005', '东⽅民丰回报赢安混合A 估值图 基⾦吧', '1.0564', '1.0709', '1.0438', '1.0583', '0.0126', '1.21%', '开放', '开放', '0.06%']
['', '', '5', '004006', '东⽅民丰回报赢安混合C 估值图 基⾦吧', '1.0463', '1.0593', '1.0338', '1.0468', '0.0125', '1.21%', '开放', '开放', '0.00%']		
		
			

24.1.2.7. 保存数据到 Excel

		
neo@MacBook-Pro-Neo ~/workspace/python % pip install openpyxl		
		
			
		
import os,pdfplumber

import pandas as pd

file = os.path.expanduser("~/tmp/每日开放式基金净值表.pdf")

# 读取pdf文件,保存为excel
with pdfplumber.open(file) as pdf:
    first_page = pdf.pages[0]

    # 自动读取表格信息,返回列表
    table = first_page.extract_table()
    # print(table)

    save = pd.DataFrame(table[2:],columns=table[0])
    # 保存excel
    save.to_excel('output.xlsx')		
		
			

24.1.3. PyPDF2