文章/答案/技术大牛

发布

问Google到文本页编号限制
EN

Stack Overflow用户

提问于 2021-06-03 13:05:31

回答 1查看 288关注 0票数 0

我是非常新的谷歌脚本。我有一些pdf文件在一个文件夹上的谷歌驱动器，我试图转换pdf到谷歌文档，并提取特定的文本。PDF有200多页，但即使是google.doc文件也限制在80页以内。您可以运行OCR的页数有限制吗？或者我错过了什么..。

我的代码如下：

//#全球#

const FOLDER_ID = "1rlAL4WrnxQ6pEY2uOmzWA_csUIDdBjVK"; //Folder ID of all PDFs
const SS = "1XS_YUUdu9FK_bBumK3lFu9fU_M9w7NGydZqOzu9vTyE";//The spreadsheet ID

表=“提取”；//工作表选项卡名

/*########################################################

documents.

Displays
主运行文件:从PDF中提取学生in，并从多个中提取学生in及其
- 部分，这是Sheet.
中学生和部分的列表。

function extractInfo(){
  const ss = SpreadsheetApp.getActiveSpreadsheet()
  //Get all PDF files:
  const folder = DriveApp.getFolderById(FOLDER_ID);
  //const files = folder.getFiles();
  const files = folder.getFilesByType("application/pdf");
  
  let allInfo = []
  //Iterate through each folderr
  while(files.hasNext()){
    Logger.log('first call');
    let file = files.next();
    let fileID = file.getId();
   
    const doc = getTextFromPDF(fileID);
    const invDate = extractInvDate(doc.text);
    
        
    allInfo = allInfo.concat(invDate);

Logger.log("Length of allInfo array: ")
Logger.log(allInfo.length);
    
  }
    importToSpreadsheet(allInfo);       //this is 80, even though pdf 
                                //has more than 200 pages with
                         //required text (invoice date) on each page
};


/*########################################################
 * Extracts the text from a PDF and stores it in memory.
 * Also extracts the file name.
 *
 * param {string} : fileID : file ID of the PDF that the text will be extracted from.
 *
 * returns {array} : Contains the file name  and PDF text.
 *
 */
function getTextFromPDF(fileID) {
  var blob = DriveApp.getFileById(fileID).getBlob()
  var resource = {
    title: blob.getName(),
    mimeType: blob.getContentType()
  };
  var options = {
    ocr: true, 
    ocrLanguage: "en"
  };
  // Convert the pdf to a Google Doc with ocr.
  var file = Drive.Files.insert(resource, blob, options);

  // Get the texts from the newly created text.
  var doc = DocumentApp.openById(file.id);
  var text = doc.getBody().getText();
  var title = doc.getName();
  
  // Deleted the document once the text has been stored.
  Drive.Files.remove(doc.getId());
  
  return {
    name:title,
    text:text
  };
}


function extractInvDate(text){
  const regexp = /Invoice Date:/g;//commented out \d{2}\/\d{2}\/\d{4}/gi;
  try{
    let array = [...text.match (regexp)];
    return array;
  }catch(e){
    
  }
};


function importToSpreadsheet(data){
  const sheet = SpreadsheetApp.openById(SS).getSheetByName(SHEET);
  
  const range = sheet.getRange(3,1,data.length,1);
  
  var j = 0;
  for (j = 0; j < data.length; j++){
    Logger.log(j);
  range.getCell(j+1,1).setValue(data[j]);
  }
  //range.sort([2,1]);
}

pdf

google-sheets

ocr

回答 1

Stack Overflow用户

发布于 2022-04-19 13:47:10

问题或限制在于Drive.Files.insert函数。

提取blob时，获取字符串，但是它有mime细节，also...one可能需要处理它。示例代码如下。根据您的需要修改

var blob =  DriveApp.getFileById(fileID).getBlob()
var txt = blob.getDataAsString()

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/67822036

复制

相似问题

问Google到文本页编号限制
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Google到文本页编号限制EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Google到文本页编号限制
EN