Skip to main content

View Post [edit]

Poster: tracey pooh Date: Jul 18, 2017 9:13am
Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

well sounds like rerunning isnt going to work.
can you paste a link or two to such items?
could be something up w/ the files or setup -- or something wrong or that we could improve w/ our processing...
thx!

Reply [edit]

Poster: TomQ Date: Jul 18, 2017 1:29pm
Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

https://archive.org/details/el_zine_5
https://archive.org/details/san_quentin_news_2008
https://archive.org/details/hownikan_28.02
https://archive.org/details/kalihwisaks_2012
https://archive.org/details/motocross_performance_15.1
https://archive.org/details/2013.06

This is a variety of examples with different challenges. In most cases, there are plenty of sibling issues that had no trouble at all.

Reply [edit]

Poster: tracey pooh Date: Jul 18, 2017 5:03pm
Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

thanks, helpful!
i'm more in the audio/video/TV end so less likely to help (unless something more _general_ going on w/ our processing).
most of these examples seem texts-ish so have let our books processing lead engr know..

sorry for the issues!

Reply [edit]

Poster: hank_b Date: Jul 18, 2017 6:32pm
Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

It seems about half of those items have now derived successfully, due to intervention yesterday by Jeff.

Our code for deriving from PDF source files (i.e., making a stack of individual jp2 page images from the multi-page PDF) is brittle, frequently encountering PDFs that it can't handle. So a general suggestion is if you're able to upload text items as zips or tars of individual page images, there's a greater chance of a successful derive. See this old blog post for more info on uploading in those formats:

https://blog.archive.org/2012/05/24/uploading-images-for-text-items/

If you do need to upload PDFs, the most common reason for failure is the code trying to make too large a jp2 from one of the PDF pages, which, in turn, happens because we chose too high a ppi to use in the conversion. We rely on some rough heuristics to determine what ppi to use, and that choice is subject to error. You can override the guess by setting a "fixed-ppi" value in the item's metadata; that is how Jeff helped some of your items. For example, with item kalihwisaks_2012, you can see in the derive log (https://catalogd.archive.org/log/688526208) that it decided to use a ppi of 600, then consistently timed out while working on p. 117:

timeout 180 pdftoppm -f 117 -l 117 -r 600 -cropbox '/var/tmp/autoclean/derive/kalihwisaks_2012/kalihwisaks_2012.pdf' > /f/_kalihwisaks_2012/tmp.ppm failed with exit code: 124

Jeff then set fixed-ppi to 200 (in task https://catalogd.archive.org/log/702495554), and the next rerun of the derive succeeded. When derives of PDFs are failing in the portion of the derive that makes jp2s from the PDF ("Module ProcessJP2"), you can try setting fixed-ppi yourself, to some value smaller than the one the derive is using. The smaller the value, the smaller (in pixels) the jp2s that we make, and the fewer resources required to make them.

Once we're past that stage of the derive, sometimes the third-party software we use for OCR fails on specific pages, as happened with item el_zine_5. There's a mechanism for telling our code to bitonalize (convert to strict black-and-white) specific pages before trying OCR, which reduces the resources required to do OCR and increases the chances of success, and there's also a mechanism for telling it to resort to skipping specific pages altogether. What's more convenient than using those mechanisms directly is to set "adaptive_ocr" to "true" in the item's metadata; that will attempt to catch any OCR failures and automatically resort to trying black-and-white, then if that fails, too, resort to skipping the page.

I've set that option for el_zine_5 and rerun the derive; we'll see how it does this time.