now that we've exchanged pleasantries

bowerbird's picture

it was just one week ago that i made my first post here.

yeah, it seems like much longer than that to me too...
i guess, like they say, time flies when you're having fun.

anyway, i'd like to thank all the people here who have
welcomed me, and made me feel at home. sincerely...

anyway, now that we have exchanged our pleasantries,
it's time to get to work...

i've programmed an app that takes the text of a book
as its input, and outputs a .pdf version and .html files.

one .html version includes the whole book in one file.

another .html version splits the book up into chapters,
creates a web-page for each chapter, and links them...

the third .html version splits the book up into _pages_,
and creates a web-page for each _page_ of the book...
this version puts an image of the page next to its text.
(it was originally developed to proof o.c.r. output.)

this granularity-level allows people to make comments
about specific aspects of the book, so each of the pages
has a form to collect the comments, and error-reports.
(that form is extremely primitive at this point; so be it.)

the main idea of this page-by-page web-site version
is that authors should corral and focus their readers,
in order to establish a community around their work,
cumulating small gifts from fans to make their living.

in this vein, the .pdf version has a link, on every page,
that takes the reader to the web-page for that page...
thus the .pdf and web-version are tightly interleaved.

now, the web-version has lousy typography, of course,
because it lives in the browser. not much hope there...

but the .pdf version? well, there's some hope there.
and that's good, since the .pdf version will be what'll
be used when fans take the print-on-demand route,
via lulu.com, createspace, or a slew of future options.

but my main goal is to make the .pdf readable as is,
as an e-book, on-screen, and for that reason, i am
eager to apply the lessons learned by typographers
-- or, more aptly, _book_designers_ -- in the past.

at the same time, i must live within the limits that
are imposed by relatively scarce resources on the
hand-held machines -- the iphone, kindle, etc. --
that currently make up the e-book infrastructure...

in the large realm, the one thing that challenges
resources to their breaking point is hyphenation,
especially since it must be real-time, on-the-fly.

we don't have enough firepower to do hyphenation,
not correctly anyway, and doing it badly is no option.

but if we can't do hyphenation, the next question is
whether we can expect to do justice to justification.
loose lines are bad in print; they're awful on-screen.

it would be very sad to give up on justification though,
because -- perhaps more than any other variable --
justification is what makes a book "look like a book"...

and to the extent that e-books can "look like a book",
we will ease their introduction to the public at large...
(it is for this reason that the kindle currently justifies,
even though the white-space rivers look like _crap_.)

so i'm doing experiments on how to keep justification
-- even after we've kicked hyphenation to the curb --
and i would like feedback from book-designers here.

if my experiments fail, i want you to say that, loudly...
i don't have anything invested in any of these tactics,
i'm simply trying to find out if any of them will _work_.

if they don't work, you won't hurt my feelings to say it.

(and even if it _would_ "hurt my feelings", you should
say it anyway, because truth compels you to be honest.)

in order not to bias my experiment, i used actual text
from an actual book (one recently released by o'reilly).

to avoid copyright concerns, however, i scrambled
the text. i randomly replaced every lowercase vowel
with a different lowercase vowel, with the effect that
the general tenor of the words and lines was retained.
(although it probably raised havoc with the kerning.)

i then used my tool to create .pdf and .html versions.

the text was unhyphenated, but i did justification on it.

so, the first question i have for you is a simple one:

> "of the 144 pages within this .pdf, how many have
> an unacceptably high number of too-loose lines".

ok, maybe that question isn't _quite_ so simple, with
all of its inherent vagueness -- what constitutes a
"too-loose" line?, and what's a "high" number of 'em?,
let alone an "unacceptably" high number?, and what
about the pages that don't have any justified lines? --
but you get the general idea of what i'm asking here...

in general, did the tactics that i used make it work?

so, there are two ways you can look at the pages...

you can just download the .pdf:
> http://z-m-l.com/go/bomoc/bomoc.pdf

or you can use the online page-by-page version:
> http://z-m-l.com/go/bomoc/bomocp001.html

in the web-version, clicking each page takes you to
the next page. or you can jump to a specific page
by changing the zero-padded number in the u.r.l.

for instance, here's the u.r.l. for page 123:
> http://z-m-l.com/go/bomoc/bomocp123.html

if you want to talk about a specific page, tell us the
page-number, so we can look at that page online...

there are lots of things that are downright _ugly_
in that .pdf. it's just the starting point from which
i will be demonstrating several improved iterations.

so there is no need to belabor all the bad aspects...
(especially the bottom-balancing should be ignored.)

right now, i'm looking for feedback on loose lines.

oh yeah, what i did is i used a multi-line approach.
not the sophisticated ones used by tex or indesign
-- again, limited resources, and all in real-time --
just a pretty simple one. nothing novel about that...

i did, however, throw in a couple of other little tricks.

i believe they should be pretty obvious to you experts,
but i'll let you tell me what they are, and not vice versa.

and then, of course, i am curious if you think that they
are tricks that are "legit" or not, and why you think that.

-bowerbird

joeclark's picture

Your HTML semantics are extremely poor; BR P is not how we mark up a paragraph, even in HTML 4.

It’s good your PDF has bookmarks, but any claimed use of a PDF for E-books requires tagged PDF, and you have to use semantics there that are even better than HTML’s.

You haven’t actually even declared a human language in either case.

Your approach is OK for an early beta, but a great deal more finesse is required, and I’m not just talking about fonts.


Joe Clark
http://joeclark.org/

abattis's picture

now, the web-version has lousy typography, of course, because it lives in the browser. not much hope there...

Well, some hope :)

http://www.alistapart.com/articles/cssatten

http://code.google.com/p/hyphenator/

http://userscripts.org/scripts/show/9022

Typedog's picture

Wow this person posted a book!
Too long im leaving.

Guerrizmo+Design
No man is an island unto himself_John Donne

bowerbird's picture

joe said:
> Your HTML semantics are extremely poor;

i agree. wholeheartedly. my focus was on the .pdf.

my .html converter wasn't prepared for many of the features
in this book, like numbered lists and outdented paragraphs.

(neither of them shows up much in the public-domain fiction
that constituted the bulk of the project gutenberg e-texts that
served as the starter-corpus for this project, believe it or not.)

> BR P is not how we mark up a paragraph, even in HTML 4.

there's a toggle in my tool that lets users retain the linebreaks
when they create the .pdf, or do the typical wrap-to-viewport.

how do you suggest that i retain linebreaks, if not with "br"?

> It’s good your PDF has bookmarks, but any claimed
> use of a PDF for E-books requires tagged PDF, and you
> have to use semantics there that are even better than HTML’s.

there's a very deep discussion that we could have on this matter.

but for now, let me just skip over that with a quick "you're right,
and i would've loved to make this a tagged .pdf, but the package
that i'm using -- pdflib lite, v5 -- is old, and doesn't support it."

it also limits me to the 14 embedded fonts, hence the helvetica...
and finally, i can't web-optimize the .pdf, which i'd like to do too.

if my eventual users wanna chip in for the new commercial version
of the pdflib package, or another package, i'll be happy to have it...
in the meantime, i content myself with what i can use at no charge.
(and i thank the pdflib people for their generosity in offering that.)

> You haven’t actually even declared a human language in either case.

as you certainly know, joe, i'm just using a template for the .html.

and i'll be the first person to tell you that i don't know much about
.html, and don't care to learn any more than i absolutely have to,
so improvements on that template will have to be made by others...

so if you'd like to create the kind of template you think i should use,
joe, i will be quite happy to take it and use it and give you full credit.

if not, i'll wait for one of my eventual users to volunteer for that job.

> Your approach is OK for an early beta, but a great deal
> more finesse is required, and I’m not just talking about fonts.

i wasn't really showing my "finesse" yet, to the extent that i have any.

i'm just seeing if i can get sufficiently tight lines without hyphenation.

do you have any feedback for me on that? :+)

-bowerbird

bowerbird's picture

abattis said:
> Well, some hope :)

great. i love hope. indeed, i have had a poster in my
front window that says "hope" for many months now.

(it's got a picture of our first black president on it too,
and hey, if _that_ could happen, then maybe browsers
will be able to give us good book typography as well!)

so let's take a look...

> http://www.alistapart.com/articles/cssatten

ok, css? seriously? i've only been here a week so far,
and there's already blood splattered all over the place,
so i think i will wait to tell you what i think about css...

in the meantime, go to the zen garden and pull up
any 20 designs at random. click your fontsize up
a few notches and see what percentage of those 20
fancy designs crumbles into unappealing messes...

or resize your window and see how many _really_
"readjust and reflow", versus how many do not...

> http://code.google.com/p/hyphenator/

very cool.

let me know when 60% of the population surfs with
a browser that supports this thing, and i'll be happy.

of course, you or i might grow old and _die_ first...

besides, the future of e-books ain't in the browser.
that's patently obvious. it's in handheld machines.
like the once-again-rumored mac 10-inch touch...

(and yes, there will be browsers on the handhelds,
and you'll be able to read e-books at websites, but
client-side non-browser apps don't have to run in
a blind sandbox, so will always be more powerful;
plus they can request data from the web if needed.
there's just no compelling need to use a browser.)

> http://userscripts.org/scripts/show/9022

this one i already knew about. haven't looked at it
very much. looks interesting, to be sure, but still,
re-cue all of the arguments that i have just made...

-bowerbird

Nick Cooke's picture

Bowerbird - You want people to bother reading this post? Try using initial caps and a longer line length and not so many line spaces. The appearance of this is extremely irritating.

BTW - I haven't bothered reading your initial post.

Nick Cooke

Nick Shinn's picture

...cumulating small gifts from fans to make their living.

aluminum's picture

If you are poo-pooing CSS, why are you bothering making a web site?

bowerbird's picture

a new nick said:
> The appearance of this is extremely irritating.

the old nick said:
> ...cumulating small gifts from fans to make their living.

aluminum said:
> If you are poo-pooing CSS,
> why are you bothering making a web site?

hey kids, if you want to bash, you'll have to do that in my
earlier thread -- "greetings, and a request for feedback".

this thread is for _working_, so only on-topic comments.

-bowerbird

Don McCahill's picture

> the one thing that challenges
resources to their breaking point is hyphenation,
especially since it must be real-time, on-the-fly.

Have you looked into Donald Knuth's algorithm for TeX hyphenation. It is published in Computers and Typesetting, Volume A, and provides pretty good hyphenation that worked on machines 20 years ago, when resources were much more limited than they are today.

Without hyphenation you are going to have a lot of trouble getting good readability.

aluminum's picture

"so only on-topic comments"

Are you NOT asking about the web site you are building?

How is CSS not on-topic?

And, as stated, don't expect very good feedback when your posts are nearly undecipherable. This isn't bashing. It's constructive feedback. Take it or leave it.

Nick Shinn's picture

the old nick said:
> ...cumulating small gifts from fans to make their living.

No, that's what you said.
The picture of an impoverished artist living off discretionary payment for her freely given work is my on-topic comment.
Deal with it, or ignore it.
But please try to refrain from snide patronizing ad hominem remarks.

BTW, people at this site have expressed dismay at the posts of David Berlow and Peter Enneson, to name but two frequent contributors with high style, but it is a difficulty with their writing, not their layout.

speter's picture

Why
Nick
whatever
do
you
mean?

Layout
is
one
of
bb's
strong
points.

bowerbird's picture

don said:
> Have you looked into Donald Knuth’s
> algorithm for TeX hyphenation.

yes.

> It is published in Computers and Typesetting, Volume A

it's pretty hairy, isn't it? :+)

> and provides pretty good hyphenation that
> worked on machines 20 years ago, when
> resources were much more limited than they are today.

when i speak of "limited resources", i mean in devices like
the iphone, the kindle, and sony reader, not the desktop...

in addition, moreover, with such machines there's a need
to execute the routine on-the-fly, in real-time, and to be
impervious to those situations which "cannot" be resolved.

tex, in contrast, is a batch program, so you're judging it
in a different context that really has little relevance here.
even if it's typically fast, there are some occasions where
it can take a while. and sometimes it will just give up and
say "this paragraph just _won't_ work as is, so rewrite it."
we simply don't have that luxury in a handheld situation.

now, it _is_ possible to have the routine do just one page,
or a few, a tactic used by some apps... (ok, one, stanza.)

but there are two problems with this approach. the first
is that the routine interrupts the reader's attentional flow
when there's a delay, however short, every page or three.

the second is that it makes the pagination indeterminate.
so, to jump to the middle of a book -- e.g., page 123 --
you need to run the routine on the first 122 pages first;
otherwise, you don't know what text falls on page 123...

the only way to avoid this is to "guess" where page 123 is.

while some viewer-programs do operate in this manner
-- such as mobipocket, and thus probably the kindle --
it's quite unsatisfactory, and leads to some weird shit...
for example, in mobipocket, paging forward to page 123
will sometimes give you _different_text_ on that "page"
than you'll get when you page _backward_ to page 123.
(and not just on page 123, but _all_ intervening pages.)

obviously, that _breaks_ our notion of what a "page" is,
in a fairly significant -- and wholly impractical -- way.

some day we will all be carrying around machinery that
would qualify as a "supercomputer" today. but not today.
(but the iphone is a "supercomputer" from 40 years ago.)

> Without hyphenation you are going to have
> a lot of trouble getting good readability.

well, that's actually the point in question here, right now.

specifically, we know that we can't do hyphenation right.

so the question is, "must we abandon justification too?"

i'd like to save justification -- as an option, at least --
so i'm doing some experimentation to find out if we can
use means other than hyphenation to avoid loose lines...

i made a .pdf -- with actual text from an actual book --
that uses justification without hyphenation, as one test...

so i'm looking for feedback from you experts here as to
whether you think that it looks "good enough" or not...

i'm also going to compute the increase in word-space,
because i think hard numbers are useful in addition to
the subjective "looks good to me" or "looks bad to me",
but, as with much typography, the ocular test is best...

***

aluminum said;
> Are you NOT asking about the web site you are building?

i am _not_ asking about the website. emphatically not...

right now, i'm asking if the justified lines in the .pdf show
sufficient tightness, or if there is an excess of loose ones.

i merely put up the website so that people who only want
to view the occasional-page-being-referenced can do so
-- like say i talked about page 123, you could just view:
> http://z-m-l.com/go/bomoc/bomocp123.html
-- without having to download the whole .pdf. (but the .pdf
is only 2.6 megs, so it's not like it's _that_ big of a burden.)

but the text on the website, and its underlying .html,
means next to nothing to me at this point in time, no.
i could just as easily have eliminated it entirely.

that having been said, though, i will eventually be _quite_
interested in using the best c.s.s. that experts can write
for the sites produced by my tool, so if you would like to
_produce_ that c.s.s., do please let me know, and i will
be quite happy to tell you what i have in mind, and then
let you go to work doing what you think should be done
-- after all, you're the expert! -- and when you're done,
i will be quite happy to evaluate it for use with my tools,
and will give full and proper credit for your contribution.

_that_ would be "constructive"...

***

one more thing...

i have regenerated the .pdf without the bottom-balancing,
because the carding caused too much variation in the color,
making it difficult to gauge the looseness of justified lines...

the new .pdf is now available at the old location:
> http://z-m-l.com/go/bomoc/bomoc.pdf

the old .pdf is renamed, if you want to look at it:
> http://z-m-l.com/go/bomoc/bomoc-1.pdf

i'll replace the page-scan images in the website as well,
since they are simply screenshots lifted out of the .pdf...

in the process, i might update the .html, for mr. clark...
(but no promises, joe.)

-bowerbird

kentlew's picture

> so i’m looking for feedback from you experts here as to
whether you think that it looks “good enough” or not...

No.

> right now, i’m asking if the justified lines in the .pdf show
sufficient tightness, or if there is an excess of loose ones.

Excess of loose ones.

> so the question is, “must we abandon justification too?”

Yes, probably. Since hyphenation seems to have been ruled out.

It seems to me that within the constraints outlined, you will have to choose between the short-term: lure folks to the e-book by making it seem more familiar -- i.e., justification -- at the risk of making it untenable for long-haul reading; and the long-term: eschew superficial familiarity in hopes of demonstrating that an e-book can be just as comfortable to read -- which in this case would seem to require FL/RR setting, so as not to cause such disruptive word spacing.

That's my professional opinion.

-- K.

bowerbird's picture

nick said:
> The picture of an impoverished artist living off discretionary
> payment for her freely given work is my on-topic comment.
> Deal with it, or ignore it.

that's off-topic in this thread.

i brought it up to explain _why_ there's a web-site version.
it has no other relevance to the topic of the ongoing thread,
where the question is still regarding loose lines in the .pdf...

to discuss that picture, do it elsewhere, such as in the old
"greetings" thread. and be prepared to receive an onslaught
of counterexamples. but don't do it here; we're working here.

and besides, why would you pick on that poor little old lady?
is that the kind of "man" you are, kicking around the feeble?

-bowerbird

aluminum's picture

"i am _not_ asking about the website. emphatically not..."

Ah, well, apologies, then!

I'll duck out of the conversation as I am obviously having a tough time traversing your posts and filtering out the specific question you might have.

bowerbird's picture

kentlew said:
> Yes, probably. Since hyphenation seems to have been ruled out.

ain't done experimenting yet. but i heartily appreciate your input. :+)

what i'll probably end up doing is computing the line-looseness,
for individual lines, and from that, pick out some pages that are
"in the middle" in regard to their overall looseness, and then ask
people like you, if you would be so kind, to judge those pages,
so i get the feel for how much looseness constitutes "too much".

bringhurst does throw out a range -- from a quarter-em to a half
(page 26) -- but that's a relatively large difference, especially as he
goes on to detail a number of other variables that can be involved.

> It seems to me that within the constraints outlined,
> you will have to choose between the short-term:
> lure folks to the e-book by making it seem more familiar
> — i.e., justification — at the risk of making it untenable
> for long-haul reading; and the long-term:
> eschew superficial familiarity in hopes of demonstrating
> that an e-book can be just as comfortable to read —
> which in this case would seem to require FL/RR setting,
> so as not to cause such disruptive word spacing.

the options aren't quite _that_ stark, fortunately...

i haven't given much of the backstory, but these
conversions -- to .html and .pdf -- constitute an
extremely minor part within the scope of my work.

the more crucial aspects of it are my _file_format_
-- it's called "z.m.l.", for "zen markup language" --
and my _authoring_tool_ and _viewer-programs_...

in the long run, people will use my authoring-tool
to create books in z.m.l., and their readers will then
read those zml-books by using my viewer-program.

these .pdf and .html versions are for the short-term.

my viewer-program is more powerful than acrobat's,
and much more reader-friendly than a web-browser,
so the lifetime of .pdf and .html versions will be short.

in my viewer-program, a simple command-j toggles the
justification, so a reader can use justification if they like,
as their default, but quickly and easily switch it off when
they encounter any page which has too many loose lines.
and turn it back on again when they go to the next page.

so, for on-screen reading, with my app, it's a non-issue.

and once people recognize my viewer-app's superiority,
they'll stop reflexively reaching for the .pdf or the .html.
so these converters are just stop-gap until that happens.

plus, for _any_ zml-book they decide to print, using
my viewer's built-in conversion routine to create .pdf,
they can specify either justification or ragged-right..

(i could even let them justify on a page-by-page basis...
haven't ever coded it, but it's a thing that could be done.)

it's only because .pdf is a "frozen" format that this is even
something that i am spending any time on. but it's better
to give the user the option to choose whatever they want,
and easier, and more flexible, so that's what my app does.

which means this is _not_ a stark "either/or" situation...

(especially since i will offer a "half-ragged" option as well,
as i seem to remember that having been a possible choice
_sometime_ in my experience, even if i can't locate it now.
you increase the word-space for each line to _one-half_
of what it would be if you were doing _full_ justification.)

> That’s my professional opinion

i appreciate it very much, and i thank you for your time...

-bowerbird

Typedog's picture

Wow, two weeks bowerbird you made it!

Guerrizmo+Design
No man is an island unto himself_John Donne

Theunis de Jong's picture

On the TeX hyphenation jabs:

The routine was developed in the early '80s. At that time, TeX ruled because no desktop machines were fast enough to run it at real time; only through batch processing could TeX deliver its pages.

Fast forward to now.

> It is published in Computers and Typesetting, Volume A

it’s pretty hairy, isn’t it? :+)

Not if you're a half decent programmer. (grin)
For laffs, I just finished programming a Windows text viewer, incorporating proper justification (up to semi-pixel precision) and Liang hyphenation thrown in. Now I'm contemplating Knuth/Plass paragraph breaking, because it seems worth a try to check if text really breaks 'nicer'.

sometimes it will just give up and say “this paragraph just _won’t_ work as is, so rewrite it.” we simply don’t have that luxury in a handheld situation

Granted -- but even desktop giant InDesign, with its advanced hyphenation (better'n Liang's! more hyphens, distinction between preferred breaking positions and less preferred ones) and not-quite-TeX-like line breaking [*], will not always produce high quality typesetting. If you only have one go, you'll have to anticipate every possible problem, such as words in other languages, mathematical formulas popping up inside running text, and the occasional this-would-look-better-if-broken-otherwise.

[*] They do use a derivation of the algorithm, but where TeX can leisurely fill up the entire RAM of a computer, contemplating a very long paragraph, ID stops evaluating a paragraph after so-many lines, breaks those into the best lines, then continues with the rest. So, while it is possible there are better line breaks possible, the routine produces a "best with limited time & memory". I presume next versions will get better, with more memory and faster CPUs.

Theunis de Jong's picture

For example, notice the un-breakable name "Übersetzenseehafenstadt", although I'm thinking the u-umlaut causes my routines not to break it. Without that, the stand-alone hypenating test program suggests "Uber*set*zensee*hafen*stadt" -- so far "the Unexpected".

*Sigh* -- you just have to think of everything.

bowerbird's picture

theunis said:
> On the TeX hyphenation jabs:

they aren't "jabs".

tex does a very nice job of hyphenating.

it's just that you cannot make it work on
a handheld, with a real-time requirement.

and that's where e-books reside, now and
from now on, not on a desktop computer...

now, when i say "you", what i really mean
is "no one so far", which includes the team
(i presume) who made kindle's viewer-app,
and people who have made iphone e-books.
(unless you think stanza did it well; i don't.)

and "cannot" means "hasn't been done yet".

so i'm not closing out the possibility for it...
i'm just saying no one has obtained it _yet_.
we must understand and acknowledge that...

moreover, i don't think we need hyphenation.
multi-line break routines do well enough, for
a reasonable percentage of english paragraphs.
(for other languages, programmers who speak
the languages must figure out what they'll do.)

> Not if you’re a half decent programmer. (g

i'm only quarter-decent, so i won't even try. ;+)

but i will be happy to look at your work,
if and when you're ready to show people.

> I presume next versions will get better,
> with more memory and faster CPUs.

but as long as you're talking about _indesign_,
you're not talking about _handheld_ machines.

and desktop computers (even laptops, unless
we're talking about those wee little eee things)
are now fairly irrelevant in the land of e-books.
handhelds are all that matters, and it will be a
very long time before they match the resources
of a desktop computer today, let alone grow to
something with "more memory and faster cpus".

indeed, since massive-run p-books will soon
be a thing of the past, except for best-sellers,
the entire workflow of one-time-layout (on the
publisher's high-end desktop computers with
indesign or quark) will morph to each individual
specifying her own preferences for a pod-book,
with layout proceeding from her personal choices.

-bowerbird

Theunis de Jong's picture

but as long as you’re talking about _indesign_,
you’re not talking about _handheld_ machines.

Correct. InDesign is an interactive DTP application, where handheld machines are barely interactive -- Page Forward, Page Back. Maybe Search as well.

I'm guessing you want hardware specs to be as low as possible, so the prudent thing is to 'prepare' text for handhelds by pre-hyphenating text. That can be done using any old system, running a hyphenator as slow as you want. But that's already been suggested, and

let me know when 60% of the population surfs with
a browser that supports this thing, and i’ll be happy.

... as a matter of fact, my first complete Knuth/Liang hyphenation program exactly did that in 2004. It did exactly what the Google.code hyphenator does: insert all possible hyphens in a HTML file. I proposed it to the firm I was working for at the mo', to be used in a HTML reference guide they were working on. It was turned down because it apparently did not "work with ALL browsers", which was kind of a bummer.

So. No hyphenation means you cannot really expect good looking justified text -- as has been remarked upon before. Case closed.

it will be a very long time before they match the resources of a desktop computer today, let alone grow to something with “more memory and faster cpus”.

Sitting here with a g*d*mn phone surpassing memory and CPU caps of my first computer by a factor I'm to weary to calculate, I tend to disagree.

the entire workflow of one-time-layout [...] will morph to each individual specifying her own preferences for a pod-book, with layout proceeding from her personal choices.

Absolutely agree on that. But I must repeat, there is no way (barring futuristic AI applied to typesetting) any random document can be typeset (oh alright, "formatted") in just about any pleasing way the customer wants. There must be some form of pre-formatting, or tagging, or whatever. Adding hyphens to the mix is not difficult.

Are you targeting currently existing hardware, or aiming for the future? If the former, you have no choice but to opt for the most common divider -- "it displays text", and that's all.

bowerbird's picture

theunis said:
> I’m guessing you want hardware specs to be as low as possible

no. i want a supercomputer in my hand. in everyone's hands.
and i want it to be cheap enough that everyone will have one...

we're just not there yet. and it'll be a while before we are there.

> the prudent thing is to ’prepare’ text
> for handhelds by pre-hyphenating

even pre-hyphenated text won't work. based on what i've observed.

i haven't yet started coding for the iphone, because it requires xcode,
and i'm not yet ready to start learning a new language, so i cannot be
as specific as i'd like to be in telling you exactly what today's machines
can and cannot do, but it's clear that the resources are relatively sparse.
when i've done some coding, i'll know how much we can push the edge.

besides, it's not insertion of all-possible-hyphens that's the problem.
current machines can do that just fine, no reason to do it beforehand.
the problem is that you can't jump to page 123 without processing the
text _before_ page 123, to know exactly what text is _on_ page 123...
we can't do _that_ fast enough without making people wait too long.
people don't like to wait these days...

> It was turned down because it apparently did not “work with
> ALL browsers”, which was kind of a bummer.

soft hyphens _still_ don't work with all browsers.

and the big reason for that is the browser-programmers _know_
hyphenation is a time-sink that will make users unhappy waiters.

> No hyphenation means you cannot really expect good looking
> justified text — as has been remarked upon before. Case closed.

you can think the "case" is "closed". fine. i'm quite sure that it's not.

> Sitting here with a g*d*mn phone surpassing memory
> and CPU caps ofmy first computer by a factor I’m to weary
> to calculate, I tend to disagree.

ok. prove it by making a hyphenation routine work on your phone
fast enough to process an entire book in real-time on-the-fly...

the world will thank you. well, ok, _typographers_ will thank you.
(everyone else will yawn and turn back to "dancing with celebrities".)

you and i have no beef. do you get that? you're arguing with _reality_.
and, in my experience, that's a waste of time, since reality always wins.
and if you _insist_ on arguing, she will slap you in the face big-time...

> I must repeat, there is no way (barring futuristic AI applied
> to typesetting) any random document can be typeset (oh alright,
> “formatted”) in just about any pleasing way the customer wants.

sure there is, depending on how "pleasing" the person "wants" it to be.
if you have unlimited time. but we want e-books to work in real-time.
that's the thing that makes all this so challenging. have i been unclear?

> There must be some form of pre-formatting, or tagging, or whatever.

when you find "whatever", let people know. that's what i'll be doing...

> Are you targeting currently existing hardware,
> or aiming for the future?

both. the future _is_ now.

-bowerbird

Theunis de Jong's picture

ok. prove it by making a hyphenation routine work on your phone fast enough to process an entire book in real-time on-the-fly...

My proposal was actually to do that beforehand. There is no reason to dismiss this idea whatsoever. That's what I meant with 'tagging' and 'preformatting' -- the more work is done in the file that goes into your hypothetical reader, the lesser it has to do on its own.

when you find “whatever”, let people know. that’s what i’ll be doing...

Yep -- the points I mentioned earlier. Live hyphenation, for example, requires all texts tagged with the correct language. It's no different than tagging text to be italic, or bold, or as quotes. Surely you don't expect your (hypothetical?) program to handle that without supervision? So, take as much of the burden of "live" formatting away from your reading device and move it into the data file. As per your example, if page numbers need to be known in advance (if your document contains references to absolute pages, e.g., a table of contents, an index, or internal references), you have no choice but to "fix" page boundaries. If you don't you will have to devise a way to generate the page references when needed. It's really one or the other.

You'll have to define your reading target, though, as formatting of a novel is quite different than that of, say, a programmer's guide to Perl (to name a random scientific text).

you and i have no beef. do you get that? you’re arguing with _reality_. and, in my experience, that’s a waste of time, since reality always wins. and if you _insist_ on arguing, she will slap you in the face big-time...

I'm merely trying to fathom what you are hoping to accomplish. No arguing involved.

the future _is_ now

Apparently it is not, since by that argument you shouldn't have any problems with futuristic specs. If your target is current software and hardware, you have no other choice than to opt for text only. No justification, no hyphenation. And do all pocket readers even support italic text?

Typedog's picture

Know-it-all bird

Guerrizmo+Design
No man is an island unto himself_John Donne

bowerbird's picture

theunis said:
> My proposal was actually to do that beforehand.
> There is no reason to dismiss this idea whatsoever.

hey, look, i love the person who does what everyone else
says cannot be done. which is why i never say that myself.

if you can do it, great! do it. lead the way. clear the path.

all i'm saying is that no one has done it up to this point...

i don't think it's because they're all deaf, dumb, and blind.
i think it's because resources on handhelds are insufficient.

presently, there are three handheld platforms of substance:
the kindle, sony reader, and iphone. and only the iphone is
open for programmers. if you want to consider the g1 too,
or the blackberry, or palm's new operating system, go ahead.

there will also be a flock of e-ink hardware "coming soon",
plus some (like the iliad) which have already been around,
though they haven't established much of a foothold as yet.

so there'll be lots of platforms "soon", albeit not right now.

but most will be resource-challenged, when compared to
the desktop, so an ability to accomplish your objectives on
the desktop won't necessary equate to handheld success...

but still, better to start _somewhere_ as soon as possible.

> That’s what I meant with ’tagging’ and ’preformatting’ —
> the more work is done in the file that goes _into_ your
> hypothetical reader, the lesser it has to do on its own.

you don't do yourself favors with the term "hypothetical".
e-book apps are real, right now, and used out in the wild...

as for preformatting, again, if you can make it work, do it.

my sense is that it won't help you that much, but perhaps
you can invent the preformatting that _will_ give a boost.
again, i won't ever tell you it can't be done.

but you will have to _do_it_, not just say it could be done.

another thing about "preformatting" is that you will have to
get buy-in agreement from the-world-at-large to do that task.
so it complicates things. still, you can worry about that later.

> Live hyphenation, for example, requires all texts
> tagged with the correct language. It’s no different than
> tagging text to be italic, or bold, or as quotes.

that's part of the "buy-in" that you'll have to get from creators.

you can _assume_ that they'll be willing to do it, but if they
balk, then your scheme won't succeed, even if it does "work".

welcome to the world of e-book application programming...

> Surely you don’t expect your (hypothetical?) program to
> handle that without supervision?

but i'm not the one with a hyphenation program. _you_ are!

and you're free to define the terms of how your program works,
and what it can and cannot handle, with or without supervision.

but if you require jobs of them, people might dismiss your system.

> So, take as much of the burden of “live” formatting _away_
> from your reading device and move it into the data file.

if you can make it work, fine, make it work.

> As per your example, if page numbers need to be known in
> advance (if your document contains references to absolute
> pages, e.g., a table of contents, an index, or internal references),
> you have no choice but to “fix” page boundaries.

well, just to be clear, that example didn't actually require that
"page numbers need to be known in advance". that is, when i
used "page 123", what i meant was "somewhere in the middle".
could have been "where chapter 6 starts", or "where this quote
is located", or "where this footnote was referenced" or whatever.

some e-book programs try to "finesse" the reality that they can't
attach a "page number" to a specific bit of text, so instead they
report that you are "62% of the way through the book", or some
such nonsense, or even just give the user some vague scroll-bar.

myself, i like my programs to be a lot more specific than that...
i like to say "you're on page 123 of 292 pages". bam! exactly!

but also note that -- when the person bumps up the fontsize,
or switches from landscape to portrait, or changes fonts --
then those 292 pages might morph into 336 pages, and then
the person would instead be "on page 135 out of 336 pages"...
you must repaginate the whole book. in real-time. on the fly.

> If you don’t you will _have_to_ devise a way to _generate_
> the page references when needed. It’s really one or the other.

right. exactly. you have to _generate_ and _update_ all of the
page references. and you have to do it on-the-fly, in real-time...

> You’ll have to define your reading target, though,
> as formatting of a novel is quite different than that of, say,
> a programmer’s guide to Perl (to name a random scientific text).

welcome to the world of e-book application programming.

nobody said it was going to be easy.

> I’m merely trying to fathom what you are hoping to accomplish.

i'm seeking feedback on my tests of some strategies to eliminate
loose lines obtained during justification _without_ hyphenation.

some people -- like you -- want to keep dragging hyphenation in...

> If your target is _current_ software and hardware,
> you have no other choice than to opt for text only.
> No justification, no hyphenation.
> And do all pocket readers even support italic text?

i wish i would have realized from the very beginning that you
have absolutely no experience with what you're talking about.

the kindle and the sony reader and _many_ iphone apps offer
justification. only a couple iphone apps offer any hyphenation,
and you wouldn't be all that impressed with their performance.

if you want to read more on these issues, take a look here:
> http://www.futureofthebook.org/blog/archives/2009/02/why_is_text_on_scre...

i'm not sure i've ever seen any e-book program that did _not_
offer styled text. even if there was one, readers wouldn't use it.

-bowerbird

Theunis de Jong's picture

(While I'm pondering your other points:)

only a couple iphone apps offer any hyphenation,
and you wouldn’t be all that impressed with their performance.

... is the reason I'm suggesting inserting hyphens into the data files. There is no reason at all the reader device should be able to hyphenate -- it should display text. Performance can be as low as economically possible.

And yes, I admit, it means data files should be prepared before reading. It means that you cannot download any Project Gutenberg text and automatically expect the best possible reading experience -- but I doubt people are realistically expecting this right now.

Or, for that matter, that it's even possible using fully automatic software; an example that springs to mind is automated translation of straight to curly quotes. See the mishaps of SmartyPants ('cause 'no we can't').

Nick Shinn's picture

...the mishaps of SmartyPants (’cause ’no we can’t’).

SmartyPants is configured at Typophile to work with North American quote marks/apostrophes, not the UK system:
('cause "no we can't").

Straight-to-curly quote software that worked with the UK system would require some kind of grammatical+dictionary intelligence to determine the difference between a single left quote and an apostrophe.

bowerbird's picture

theunis said:
> is the reason I’m suggesting inserting hyphens into the data files.

but, once again i say, that's not where the time-bottleneck is.

when you know exactly what word needs to be hyphenated,
it's not time-consuming to find the word in a look-up table.

(preinsertion might not save _any_ time, whole-picture-wise,
because you will need to delete all the unnecessary hyphens.)

the time-bottleneck comes from paginating the entire book...

note that your inexperience here is starting to test my patience;
you need working code if you want to make a convincing point.

> There is no reason at all the reader device should be able
> to hyphenate — it should display text.

again, if you want preformatting, then it gets even more hairy,
because you then need additional buy-in for the preformatting.

so, in addition to working code, you need charisma as well...

> And yes, I admit, it means data files should be prepared

right. and that has ramifications that go beyond working code.
but for now, it would be good enough to deliver working code...

> It means that you cannot download any Project Gutenberg text
> and automatically expect the best possible reading experience
> — but I doubt people are realistically expecting this right now.

that happens to be the exact goal i had when i started this work.

and i was expecting that preformatting would be a requirement.
(and -- since i was willing to do that preformatting by myself --
i had the necessary buy-in.)

and i found that the job _can_ be done without preformatting.

not with project gutenberg e-texts per se, because they have
tons of inconsistencies. but if they had been properly prepared
-- by which i mean they had merely followed their own rules --
i would indeed be able to download them and run them _as_is_.

so now, instead of "preformatting" them, i'm merely making them
_consistent_, which is a much smaller burden, largely automatic...

> Or, for that matter, that it’s even possible using fully automatic
> software; an example that springs to mind is automated translation
> of straight to curly quotes. See the mishaps of SmartyPants
> (’cause ’no we can’t’).

smartypants isn't quite as smart as it could be.

but that's all beside the point, because even with preformatting,
you can't get the performance you need if you do hyphenation...

or maybe _you_ can, but you'll need to show me the working code.

-bowerbird

bowerbird's picture

here's one that i seem to have forgotten to post.

and since it's _relevant_ to the actual _topic_ here...

it's a reply to the comment from theunis where he
gave sample output from his hyphenation routine,
the comment dated 14 march, timestamp 5:57pm.
it's the comment that contains the little screenshot.

***

oh-oh. better watch out. :+)

if you think i can talk a lot at a _general_ level,
don't get me started with _specific_ examples...

below are 6 different takes, using _your_ text...

no hyphenation, just different breaks of the lines.
and, for a few, some extremely slight re-writes...

you might consider rewrites to be "unfair". if so,
then your point boils down to the statement that,
for any piece of text, wrapping it to various widths
might lead to loose lines for some of those widths.

but nobody is arguing that statement. _nobody._
least of all _me_. but really, nobody in the world.
because it's bloody obvious. all you have to do is
imagine constructions consisting of numerious
consecutive multi-syllabic sequential monstrosities
masquerading as meaningful meanderings, without
introducing many shorter words therein so as to give
the break routines some flexibility with which to work.
heck, most of the time, three (and sometimes just two)
long words in a row is enough to cause you problems.
(or even one word, if it's _ubersetzenseehafenstadt_.)

so if one loose line on a page, or even two or three,
is enough to upset your frail sensibilities, then yes,
you will undoubtedly have to resort to hyphenation,
if you want justification, or use ragged-right instead.

but the people who _want_ justification will generally
be willing to put up with _some_ loose lines to get it,
if that's what it takes. and they'll likely prefer _that_
over the other option -- long waits for hyphenation...

in addition, if you also give them a convenient toggle
for the justification, thereby allowing a quick and easy
switch-over to ragged-right when the loose lines on
a specific page get to be too much for them to handle,
with a quick-and-easy switch-back on the next page,
i think you've got the solution they will come to prefer.
(even if it was not "optimal" in some theoretical sense.)

it's a balancing act...

***

also, getting back to your example, under my rules,
there's only _one_ set of breaks that _must_ be tight
-- the author's linebreaks on the "canonical" version.

in other words, in my workflow, the _author_ does the
line-breaks, so thus is right there to rewrite if needed.

(the publisher has been disintermediated, thank god.)

so if the linebreaks on the "canonical" version are tight,
that's all readers expect. they know full well that if they
rewrap to some other width, there might be loose lines.
that's life.

-bowerbird

p.s. some possible reworkings of your paragraph...

he was born graf heinrich karl wilheim otto friedrich
von ubersetzenseehafenstadt, but changed his name
to nigel st. john gloamthorpby, a.k.a. lord woadmire,
in 1914. in his photograph, he looks every inch a von
ubersetzenseehafenstadt, and he is entirely free of the
cranial geometry problem so evident in the older portraits.
lord woadmire is not related to the original ducal line of
qwghlm, the moore family (anglicized from the qwghlmian
clan name mnyhrrgh) which had been terminated in 1888
by a spectactularly improbable combination of schistosomiasis,
suicide, long-festering crimean war wounds, ball lightning,
flawed cannon, falls from horses, improperly canned oysters,
and rogue waves.

he was born graf heinrich karl wilheim otto friedrich
von ubersetzenseehafenstadt, but changed his name
to nigel st. john gloamthorpby, a.k.a. lord woadmire,
in 1914. in his photograph, he looks every inch a von
ubersetzenseehafenstadt, and he is entirely free of the
cranial geometry problem so evident in the older portraits.
lord woadmire is not related to the original ducal line of
qwghlm, the moore family (anglicized from the qwghlmian
clan name mnyhrrgh) which was terminated in 1888 by a
spectactularly improbable combination of schistosomiasis,
suicide, long-festering crimean war wounds, ball lightning,
flawed cannon, falls from horses, improperly canned oysters,
and rogue waves.

he was born graf heinrich karl wilheim otto friedrich von
ubersetzenseehafenstadt, but changed his name to nigel
st. john gloamthorpby, a.k.a. lord woadmire, in 1914. in his
photo, he looks every inch a von ubersetzenseehafenstadt,
and he is entirely free of the cranial geometry problem so
evident in the older portraits. lord woadmire is not related
to the original ducal line of qwghlm, the moore family
(anglicized from the qwghlmian clan name mnyhrrgh) which
had been terminated in 1888 by a spectactularly improbable
combination of schistosomiasis, suicide, long-festering
crimean war wounds, ball lightning, flawed cannon, falls
from horses, improperly canned oysters, and rogue waves.

he was born graf heinrich karl wilheim otto friedrich von
ubersetzenseehafenstadt, but changed his name to nigel
st. john gloamthorpby, a.k.a. lord woadmire, in 1914. in his
photo, he looks every inch a von ubersetzenseehafenstadt,
and he is entirely free of the cranial geometry problem so
evident in the older portraits. lord woadmire is not related
to the original ducal line of qwghlm, the moore family
(anglicized from the qwghlmian clan name mnyhrrgh)
which had been terminated in 1888 by a spectactularly
improbable combination of schistosomiasis, suicide,
long-festering, crimean war wounds, ball lightning, flawed
cannon, falls from horses, improperly canned oysters, and
rogue waves.

he was born graf heinrich karl wilheim otto friedrich von
ubersetzenseehafenstadt, but changed his name to nigel
st. john gloamthorpby, a.k.a. lord woadmire, in 1914. in his
photo, he looks every inch a von ubersetzenseehafenstadt,
and he is entirely free of the cranial geometry problem so
evident in the older portraits. lord woadmire is not related
to the original ducal line of qwghlm, the moore family
(anglicized from the qwghlmian clan name mnyhrrgh)
which had been terminated in 1888 by a spectactularly
improbable combination of schistosomiasis, suicide,
long-festering, crimean war wounds, ball lightning,
flawed cannon, falls from horses, improperly canned
oysters, and rogue waves.

he was born graf heinrich karl wilheim otto friedrich
von ubersetzenseehafenstadt, but changed his name
to nigel st. john gloamthorpby, a.k.a. lord woadmire,
in 1914. in his photograph, he looks every inch a von
ubersetzenseehafenstadt, and he is entirely free of
the cranial geometry problem so evident in the older
portraits. lord woadmire is not related to the original
ducal line of qwghlm, the moore family (anglicized
from the qwghlmian clan name mnyhrrgh) which had
been terminated in 1888 by a spectactularly improbable
combination of schistosomiasis, suicide, long-festering,
crimean war wounds, ball lightning, flawed cannon, falls
from horses, improperly canned oysters, and rogue waves.

Theunis de Jong's picture

Well, I'm working on it :-)

Nick:
...some kind of grammatical+dictionary intelligence to determine the difference between a single left quote and an apostrophe...

Sure. I tried to come up with a sentence containing both 'cause and cause to illustrate that. (No luck, on such a short run -- sorry, I'm not a native. Couldn't find a good use for 'twixt either.)

As for bowerbird (apologies for changing the order around), returning to hyphenation code just this once more:

..you need working code if you want to make a convincing point...

preceded by

.. preinsertion might not save _any_ time, whole-picture-wise, because you will need to delete all the unnecessary hyphens.

My code--existing on my hard disk and all--does finds breaking points. A test routine actually inserts them--so I can verify if they're correct. However: inserting them physically into the text, checking lengths, then removing extraneous ones before blitting to screen, is totally redundant. The breaking code returns breaking points, offset from the start of the word. The pragraph breaking routine expects lengths and widths of words and/or fragments; not the actual words. Only when drawn to the screen, the characters are called up from memory. This separates 'formatting' (= line breaking) nicely from 'drawing'.

The good news is I'm taking the same approach as you are on the un-/badly formatted Gutenberg texts: preprocessing them makes it all a bit tidier.
But it seems our targets diverge from there. Oh well.

If you're interested in 'the very best line breaking' for any width, you should (indeed, perhaps you already have) take a look at Knuth/Plass line breaking. A good introduction for me was this primer text. The example shown uses hyphenation, but the general algorithm works just as well without, which is a pretty important point for you. You can't get any better than that.

BruceS63's picture

"Brevity is the soul of wit." Not to mention, intelligence.

-------------------------------------
Inventor of the DVD rewinder

Theunis de Jong's picture

Such a counterpart shall fame his wit.

I'd suggest "Thine Enmity's most Capital" -- perhaps looking better in all lowercase.

bowerbird's picture

theunis said:
> Well, I’m working on it :-)

and, as i said from the outset, i look forward to seeing it. :+)

***

> I tried to come up with a sentence containing
> both _’cause_ and _cause_ to illustrate that.

the occasional exception won't really invalidate results that
are completely correct in the vast majority of cases, though.

if you want to discuss curly-quote conversion, start a thread,
and i'll be happy to share the results of my experimentation.

***

> My code—existing on my hard disk and all— _does_
> finds breaking points. A test routine actually inserts them
> — so I can verify if they’re correct.

just a quick note here to say that i don't need to _see_ your code.
in fact, if it's novel at all, i would suggest you keep it under wraps.

i just need to see it in action. compile a program that i can run.
i'm on mac, but i'll find a windows box if that's all you can make.

> _However:_ inserting them physically into the text,
> checking lengths, then removing extraneous ones
> before blitting to screen, is totally redundant.

well, i would've sworn you mentioned preinsertion up above.
but if you now contend "preformatting" involves something
different than that, i will accept that. you can set whatever
conditions you want. whether people _buy_ them or not is a
different story, but for our purposes here, whatever you say.

> The breaking code returns breaking points, offset from
> the start of the word. The pragraph breaking routine
> expects lengths and widths of words and/or fragments;
> not the actual _words._

ok, that's another way that people have attacked this issue, yes.

the problem with that, as you might or might not know, is that
in order to know "the lengths and widths" of those fragments,
you need their font and fontsize. but one important aspect of
electronic-books is that the end-user sets the font and the size.
the user also sets the size of the viewport, another consideration.
in the more sophisticated programs, the user might also control
other aspects that affect layout, such as the position and size of
pictures and other graphic elements (e.g., space around callouts).

some e-book programs try to "finesse" this problem by giving
the user a small number of font-sizes -- the kindle gives you 4,
or maybe it's 6, if memory serves correctly, and they only offer
1 or 2 fonts -- so they can compute the widths for all the sizes.

but that kind of constraint will _not_ be tenable in the long run.
e-book end-users _expect_ that they can use any font they want,
and they _expect_ that they can make that font any size they like.

(and yeah, you can scale the numbers, but then you've moved to
double-math instead of integer-math, and that means it's slower.)

> If you’re interested in ’the very best line breaking’ for any width

um, no, that's the very thing we've been discussing all along...

the situation has scarce resources and a real-time requirement,
so i'm looking for the best _possible_ line-break routines that
_meet_ those requirements. it's all about living with constraints.

> If you’re interested in ’the very best line breaking’ for any width,
> you should (indeed, perhaps you already have) take a look at
> Knuth/Plass line breaking.

i've looked at it. i've studied it. it cannot meet the requirements,
as far as i can see... so i use simpler routines that give me breaks
that are _almost_ as good, in _most_ situations_, and i move on...

> The example shown uses hyphenation, but the general algorithm
> works just as well without, which is a pretty important point for you.
> You can’t get any better than that.

no, but you can get _almost_ as good, and do the job a _lot_ faster,
which happens to be the superior solution in this particular situation.

for instance, was there anything wrong with the breaks i gave above,
on your actual text? the rag looked pretty good, at least to my eyes.

-bowerbird

paragraph's picture

If there ever comes a time
to pluck the bowerbird,
it ought to be done well,
so it stays plucked.

joeclark's picture
  • CSS controls whitespace, among many other things, as you should know by now.
  • No matter how you do it, you’re going to have to improve your HTML and PDF semantics, even if that means throwing out your existing infrastructure and starting from scratch.
  • html lang="en-US" (or any ISO language code optionally with extension)
  • I do too much volunteer work as it is.

We would appreciate it if you stopped using so many hard returns in your postings here. Two returns after a paragraph are more than enough.


Joe Clark
http://joeclark.org/

bowerbird's picture

joe said:
> No matter how you do it, you’re going to
> have to improve your HTML and PDF semantics,
> even if that means throwing out your existing infrastructure

as i said, it's simply a matter of rewriting the template,
and when my users do that task, i'll happily update it...

because if i don't have any users who _want_ this done,
or any users at all, there's no need for me to do it myself.

i suspect, however, that when you say i "have to" do this,
what you're really saying is that if i want my output to be
accessible, that's what i'd "have to" do.

my answer is that i _do_ want my output to be accessible,
but that accessibility will be a product of my z.m.l. format
with its viewer-program, _not_ the .html or .pdf formats...

as far as i'm concerned, .html and .pdf are legacy formats;
they'll be replaced by some light-markup format like mine.
but this, of course, is another conversation for another day.

> I do too much volunteer work as it is.

thanks for your feedback...

-bowerbird

Syndicate content Syndicate content