c 从 pdf 中提取文本和图像-k8凯发天生赢家

c 从 pdf 中提取文本和图像

从 pdf 文件中提取文本和图像让您能够在其他类型的文件（例如 word 文档、网页或演示文稿）中快速重复使用这些内容。这种方法可以帮助您节省大量时间和精力，因为它省略了从头开始重新键入信息的繁琐且耗时的过程。在本文中，您将学习如何使用 spire.pdf for c 从 pdf 文件中提取文本和图像。

安装 spire.pdf for c

有两种方法可以将 spire.pdf for c 集成到您的应用程序中。一种方法是通过安装它，另一种方法是从我们的网站下载包并将库复制到您的程序中。通过 nuget 安装更简单，更推荐使用。您可以通过访问以下链接找到更多详细信息。

如何将 spire.pdf for c 集成到 c 程序中

从 pdf 中提取文本

spire.pdf for c 提供了 pdfpagebase->extracttext() 方法，使您能够从 pdf 文件中的页面中提取文本。具体步骤如下：

初始化 pdfdocument 类的实例。
使用 pdfdocument->loadfromfile() 方法加载 pdf 文件。
遍历文件中的所有页面。
使用 pdfpagebase->extracttext() 方法从页面中提取文本。
将提取的文本保存为 txt 文件。

#include "spire.pdf.o.h"
#include 
#include 
using namespace spire::pdf;
using namespace std;
int main()
{
	//初始化pdfdocument类的实例
	pdfdocument* doc = new pdfdocument();
	//加载pdf文件
	doc->loadfromfile(l"极昼极夜是怎么形成的.pdf");
	wstring buffer = l"";
	//遍历文件中的所有页面
	for (int i = 0; i < doc->getpages()->getcount(); i  )
	{
		pdfpagebase* page = doc->getpages()->getitem(i);
		//从页面中提取文本
		buffer  = (page->extracttext());
	}
	//将提取的文本保存为txt文件
	wofstream write(l"提取文本.txt");
	auto locutf8 = locale(locale(""), new std::codecvt_utf8);
	write.imbue(locutf8);
	write << buffer;
	write.close();
	doc->close();
	delete doc;
}

c 从 pdf 中提取文本和图像

从 pdf 中的特定页面区域提取文本

您可以使用 page->extracttext(rectanglef* rectanglef) 方法从 pdf 页面的特定矩形区域提取文本。具体步骤如下：

初始化 pdfdocument 类的实例。
使用 pdfdocument->loadfromfile() 方法加载 pdf 文件。
使用 pdfdocument->getpages()->getitem(int index) 方法通过索引获取特定页面。
使用 page->extracttext(rectanglef* rectanglef) 方法从页面的特定矩形区域提取文本。
将提取的文本保存为 txt 文件。

#include "spire.pdf.o.h"
#include 
#include 
using namespace spire::pdf;
using namespace std;
int main()
{
	//初始化pdfdocument类的实例
	pdfdocument* doc = new pdfdocument();
	//加载pdf文件
	doc->loadfromfile(l"极昼极夜是怎么形成的.pdf");
	//获取指定页面
	pdfpagebase* page = doc->getpages()->getitem(0);
	//从页面中的特定矩形区域提取文本
	wstring text = page->extracttext(new rectanglef(0, 0, 600, 200));
	//将提取的文本保存为txt文件
	wofstream write(l"从指定区域提取文本.txt");
	auto locutf8 = locale(locale(""), new std::codecvt_utf8);
	write.imbue(locutf8);
	write << text;
	write.close();
	doc->close();
	delete doc;
}

c 从 pdf 中提取文本和图像

从 pdf 中提取图像

您可以使用 pdfpagebase->extractimages() 方法从 pdf 文件中的页面中提取图像。具体步骤如下：

初始化 pdfdocument 类的实例。
使用 pdfdocument->loadfromfile() 方法加载 pdf 文件。
遍历文件中的所有页面。
使用 pdfpagebase->extractimages() 方法从页面中提取图像。
将提取的图像保存为 png 文件。

#include "spire.pdf.o.h"
#include 
#include 
using namespace spire::pdf;
using namespace std;
int main()
{
	//初始化pdfdocument类的实例
	pdfdocument* doc = new pdfdocument();
	//加载pdf文件
	doc->loadfromfile(l"我国养老金问题如何缓解.pdf");
	int index = 0;
	//遍历文件中的所有页面
	for (int i = 0; i < doc->getpages()->getcount(); i  )
	{
		pdfpagebase* page = doc->getpages()->getitem(i);
		//从页面中提取图像
		for (auto image : page->extractimages())
		{
			std::wstring imagefilename = l"images\\image-"   to_wstring(index)   l".png";
			image->save(imagefilename.c_str(), imageformat::getpng());
			index  ;
		}
	}
	doc->close();
	delete doc;
}

c 从 pdf 中提取文本和图像

申请临时 license

如果您希望删除结果文档中的评估消息，或者摆脱功能限制，请该email地址已收到反垃圾邮件插件保护。要显示它您需要在浏览器中启用javascript。获取有效期 30 天的临时许可证。