<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="blog.xsl"?>
<article>
 <title>How to Build an Image Extraction Webapp using Java</title>
 <subtitle/>
 <excerpt>Our second image extraction article, this time using a java webapp.</excerpt>
 <time>2016-12-02T15:44:06</time>
 <author>leo</author>
 <category>pdf</category>
 <tags>bitmap extraction image</tags>
 <body>
  <p>
   In this post we will take the code from our previous
   <a href="/blog/2016/11/24/extracting_images_from_pdf_with_the_bfo_pdf_library/">Image Extraction post</a>
   to build a complete web service/web-app for
   uploading PDFs and also for downloading extracted images.
  </p>

  <h2>A few modifications to the Java program</h2>
  <p>
   If you compare the <a viewtext="true" href="SimpleExtractor.java">Java code</a> from this
   blog post to the
   <a href="/blog/2016/11/24/extracting_images_from_pdf_with_the_bfo_pdf_library/">first post</a>,
   you will spot a couple of subtle differences. Firstly the
   “extraction” code has been shaped into an object. The rendering part has
   been placed into an enum and the main method has been polished to offer some
   command line options like, specifying the source PDF, the output path and
   which bitmap format(s) should be used.
  </p>
  <p>
   I also enforced the use of a Java Logger. This is always good to have. It
   will help to keep a clear console output, making it easy to parse from the
   code that will call the Java program.  Another important point is that our
   program should use the <code>System.exit</code> method.  It is used to
   signal how a program is completed. I am sure some of you remember the good
   old Windows 98 days and getting the error messages stating, “system error
   code 8206”.  Any case other than 0 is an error. In our case, I defined two
   error codes:
  </p>
  <ol>
   <li>The program was not correctly called, there were some missing parameters</li>
   <li>The program could not read the given PDF</li>
  </ol>

  <h2>Using the program we built to create a web-app</h2>
  <p>
   I went for a JS stack for this web-app, with Node.js/Express in the
   back-end, Vue.js for the front and
   <a href="https://getmdl.io" target= "_blank" rel="nofollow">Material Design Lite</a>
   for the UI.  Let us keep the debate about which tech stack is the best for
   another time!  No matter which techs you are using, the steps are all going
   to be the same:
  </p>
  <ul>
   <li>User uploads a PDF to your server</li>
   <li>Pass it along to the Java program</li>
   <li>BFO does its work</li>
   <li>Provides an option for the user to download the extracted images</li>
  </ul>
  <h3>Key Points with Express</h3>
  <p>
   Before pressing ahead we need to help Express upload the PDF files and for
   that we will use the well-known module <code>multer</code>.
  </p>
  <pre class="brush:javascript">
   // require the module
   var multer = require("multer");
   // create a middleware-helper for the upload
   var upload = multer({ dest: "pdfs/" });
  </pre>
  <p>
   Once initialized, this module offers us a lot of
   <a href="https://github.com/expressjs/multer" target= "_blank" rel="nofollow">possibilities</a>
   which we will use to create a middleware for handling the upload:
  </p>
  <pre class="brush:javascript">
   router.post("/extract", upload.single("pdfFile"), function (req, res) {
       // now we can access the uploaded file like that
       let pdfFilePath = req.file.path;
       console.log(`the file was uploaded to ${pdfFilePath}`);
   });
  </pre>
  <p>
   Another useful module used to manage the file-system and create/delete files
   and folders more easily than the default module of Node.js, is 
   <a href="https://github.com/jprichardson/node-fs-extra" target= "_blank" rel="nofollow"><code>fs-extra</code></a>.
   With that, we will be able to clear entire directories once the users are
   finished with the images.
  </p>
  <p>
   However, the most important part of the work on the server side is to call
   the Java extraction program and to harvest the results.  We will use the
   <code>spawn</code> method from the <code>child_process</code> module.  This
   will enable us to run a bash command from node:
  </p>
  <pre class="brush:javascript">
    let child = spawn(
       'java',
       [
         '-jar',
         './jars/ImageExtractor.jar',
         // put here the rest of the parameters (pdf file path, type of picture to create, output path to use)
       ]
     )
  </pre>
  <p>Once the process has started, we can monitor it by placing listeners:</p>
  <pre class="brush:javascript">
     child.stdout.on('data', (data) => {
       // the process called sent out some text to the standard output (a.k.a System.out.print())
       console.log(`ok data => '${data}'`);
     });

     child.stderr.on('data', (data) => {
       // the process sent out some text on the standard error output (e.g. System.err.println())
       console.log(`err data => '${data}'`);
     });

     child.on('close', (code) => {
       // the process finished, code will be system exit code
       if (code === 0) {
         // it worked!
       } else {
         // there was some error
       }
     });
  </pre>
  <p>
   You can check the
   <a href="https://github.com/leojpod/ImageExtractor/blob/master/web/routes/api.js" target="_blank">complete code in <code>api.js</code></a>
   to see how the pieces are put together to create the complete service. We have also created a
   <a href="https://github.com/leojpod/ImageExtractor/" target = "_blank"> file repository</a> for reference.
  </p>
     
  <b>Leo Jeusset</b><br/>
  Freelance developer and BFO guest blogger<br/>
  <a href="https://twitter.com/leojpod" target="_blank" rel="nofollow"> https://twitter.com/leojpod</a><br/>
 </body>
</article>