Reply to this post | See parent post | Go Back
View Post [edit]

Poster: tracey pooh Date: Jul 18, 2017 9:13am

Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

well sounds like rerunning isnt going to work.
can you paste a link or two to such items?
could be something up w/ the files or setup -- or something wrong or that we could improve w/ our processing...
thx!

Reply to this post
Reply [edit]

Poster: TomQ Date: Jul 18, 2017 1:29pm

Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

https://archive.org/details/el_zine_5
https://archive.org/details/san_quentin_news_2008
https://archive.org/details/hownikan_28.02
https://archive.org/details/kalihwisaks_2012
https://archive.org/details/motocross_performance_15.1
https://archive.org/details/2013.06

This is a variety of examples with different challenges. In most cases, there are plenty of sibling issues that had no trouble at all.

Reply to this post
Reply [edit]

Poster: tracey pooh Date: Jul 18, 2017 5:03pm

Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

thanks, helpful!
i'm more in the audio/video/TV end so less likely to help (unless something more _general_ going on w/ our processing).
most of these examples seem texts-ish so have let our books processing lead engr know..

sorry for the issues!

Reply to this post
Reply [edit]

Poster: hank_b Date: Jul 18, 2017 6:32pm

Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

It seems about half of those items have now derived successfully, due to intervention yesterday by Jeff.

Our code for deriving from PDF source files (i.e., making a stack of individual jp2 page images from the multi-page PDF) is brittle, frequently encountering PDFs that it can't handle. So a general suggestion is if you're able to upload text items as zips or tars of individual page images, there's a greater chance of a successful derive. See this old blog post for more info on uploading in those formats:

https://blog.archive.org/2012/05/24/uploading-images-for-text-items/

If you do need to upload PDFs, the most common reason for failure is the code trying to make too large a jp2 from one of the PDF pages, which, in turn, happens because we chose too high a ppi to use in the conversion. We rely on some rough heuristics to determine what ppi to use, and that choice is subject to error. You can override the guess by setting a "fixed-ppi" value in the item's metadata; that is how Jeff helped some of your items. For example, with item kalihwisaks_2012, you can see in the derive log (https://catalogd.archive.org/log/688526208) that it decided to use a ppi of 600, then consistently timed out while working on p. 117:

timeout 180 pdftoppm -f 117 -l 117 -r 600 -cropbox '/var/tmp/autoclean/derive/kalihwisaks_2012/kalihwisaks_2012.pdf' > /f/_kalihwisaks_2012/tmp.ppm failed with exit code: 124

Jeff then set fixed-ppi to 200 (in task https://catalogd.archive.org/log/702495554), and the next rerun of the derive succeeded. When derives of PDFs are failing in the portion of the derive that makes jp2s from the PDF ("Module ProcessJP2"), you can try setting fixed-ppi yourself, to some value smaller than the one the derive is using. The smaller the value, the smaller (in pixels) the jp2s that we make, and the fewer resources required to make them.

Once we're past that stage of the derive, sometimes the third-party software we use for OCR fails on specific pages, as happened with item el_zine_5. There's a mechanism for telling our code to bitonalize (convert to strict black-and-white) specific pages before trying OCR, which reduces the resources required to do OCR and increases the chances of success, and there's also a mechanism for telling it to resort to skipping specific pages altogether. What's more convenient than using those mechanisms directly is to set "adaptive_ocr" to "true" in the item's metadata; that will attempt to catch any OCR failures and automatically resort to trying black-and-white, then if that fails, too, resort to skipping the page.

I've set that option for el_zine_5 and rerun the derive; we'll see how it does this time.

Internet Archive Audio

Featured

Top

Images

Featured

Top

Software

Featured

Top

Books

Featured

Top

Video

Featured

Top

Mobile Apps

Browser Extensions

Archive-It Subscription

Save Page Now

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: tracey pooh Date: Jul 18, 2017 9:13am

Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

Reply to this post
Reply [edit]

Poster: TomQ Date: Jul 18, 2017 1:29pm

Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

Reply to this post
Reply [edit]

Poster: tracey pooh Date: Jul 18, 2017 5:03pm

Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

Reply to this post
Reply [edit]

Poster: hank_b Date: Jul 18, 2017 6:32pm

Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

Poster:	tracey pooh	Date:	Jul 18, 2017 9:13am
Forum:	forums	Subject:	Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

Poster:	TomQ	Date:	Jul 18, 2017 1:29pm
Forum:	forums	Subject:	Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

Poster:	hank_b	Date:	Jul 18, 2017 6:32pm
Forum:	forums	Subject:	Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

Internet Archive Audio

Featured

Top

Images

Featured

Top

Software

Featured

Top

Books

Featured

Top

Video

Featured

Top

Mobile Apps

Browser Extensions

Archive-It Subscription

Save Page Now

Reply to this post | See parent post | Go Back View Post [edit]

Poster: tracey pooh Date: Jul 18, 2017 9:13am Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

Reply to this post Reply [edit]

Poster: TomQ Date: Jul 18, 2017 1:29pm Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

Reply to this post Reply [edit]

Poster: tracey pooh Date: Jul 18, 2017 5:03pm Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

Reply to this post Reply [edit]

Poster: hank_b Date: Jul 18, 2017 6:32pm Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: tracey pooh Date: Jul 18, 2017 9:13am

Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

Reply to this post
Reply [edit]

Poster: TomQ Date: Jul 18, 2017 1:29pm

Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

Reply to this post
Reply [edit]

Poster: tracey pooh Date: Jul 18, 2017 5:03pm

Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'

Reply to this post
Reply [edit]

Poster: hank_b Date: Jul 18, 2017 6:32pm

Forum: forums Subject: Re: 'derive.php' task repeatedly reverts to 'waiting for admin'