aozoraconvert

package module

v2.0.0-alpha Latest Latest Go to latest Published: Jul 31, 2025 License: AGPL-3.0 Imports: 24 Imported by: 2

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/adamay909/AozoraConvert

Links

Open Source Insights

README ¶

AozoraConvert

これは何？

青空文庫注記のファイルから抽象構文木（AST）を抽出し、そこから各種形式に直列化するライブラリ及びコマンドラインツールです。今のところ、出力はLaTeX、HTML（及びにそれをもとにしたEpubとAZW3）、青空文庫形式風テキストファイル、JSONに対応しています。AST経由なので他の直列化形式に対応することも比較的簡単です。

開発言語はGo。コンパイル言語なので速いです。例えばHTMLへの変換はAozoraHackにあるAozora2Htmlの20倍から30倍のスピードでできます。他の形式への変換も同程度の速さです。WASMを利用してブラウザ内で実行しても変換に要する時間は気にならないほどのスピードで動かせます。aozora.orihasam.comではそれを利用しています。いわゆるウェブアプリですが、ASTの抽出と各フォーマットへの変換はすべてローカルブラウザ内で行われるので簡易版デスクトップGUIアプリといったところです。

ここではコマンドラインツールを通して、変換の特徴などについて説明したうえ、最後にaozora.orihasam.comの利用について少し説明します。

インストール

Goがインストールされていることが前提です。開発環境はGo 1.24 ですがもう少し古くても大丈夫だと思います。

cmd/aozoraConvertフォルダに移り、

go install

でインストール完了です。

コマンドラインツールの基本的な使い方

以下に使用例を通じてaozoraConvertの説明をします。入力ファイル名と出力ファイル名は拡張子が重要なので明示されています。また、オプションは重複するので、LaTeX出力で一番詳しく説明し、以降はだいぶ省略した説明にします。

入力ファイルのエンコーディングをSHIFT_JIS、utf-8のいずれも受け付けます。まずutf-8かどうかチェックし、違えばSHIFT_JISとみなして作業を続けるようになっています。

基本情報

aozoraConvert [オプション] 入力ファイル名

  -format string
    	出力ファイル形式. 出力ファイル名に拡張子より優先される. 可能な形式はtex, txt, html, json, epub, azw3
  -fragment
    	入力ファイルを青空文庫形式ファイルの断片とみなす.
  -full
    	ヘッダーやプリアンブルを含む完全な形式のファイルを出力する.
  -jis0208
    	外字の置き換えなし.
  -jis0213
    	外字の置き換えをJIS0213範囲内に抑える.
  -o string
    	出力ファイル名. 指定されなければstdout. (default "/dev/stdout")
  -rubyForEmph
    	出力がHTML系の場合、傍点類にCSSのtext-emphasisを利用する. (default true)
  -sjis
    	出力ファイルのエンコーディングをShift_JISにする. デフォルトはutf-8.
  -strict
        自動修正を行わない.
  -supportFiles
    	サポートファイルも出力する.

LaTeXへの変換

下記する一つの例外を除き注記一覧にあるすべての注記に対応します。

使用するLaTeXファイルで使うコマンドはかなりの部分がazconvというプレフィクスをつけたものになっています。これらの定義は一緒に出力できるazcommands.texにあります。

LuaLaTeXでのコンパイルを前提に作っていますが、漢文が含まれていない限りupLaTeX用に書き換えることはそれほど難しくないはずです。

テキストファイルからコンパイル可能なLaTeXへの変換

aozoraConvert -supportFiles -full -o 出力ファイル.tex 入力ファイル.txt

入力ファイルは青空注記されたテキストファイル。拡張子が.texである限り出力ファイル名は自由です。-supporFilesのオプションでパッケージやカスタムコマンドなどを定義するazcommands.texが一緒に出力されます。いらない場合はこのオプションを省けばテキストファイルをLaTeXに変換したもののみ出力されます。

出力ファイルがhoge.texで、azcommands.texも同一フォルダにあれば（-supportFilesオプションを使えばあります）

lualatex hoge.tex

でPDFファイルを作成できます。割り注がない限り一回のコンパイルで十分なはずです。ただしコンパイル時のログやPDFを確認した上で手動の修正が必要になることはありえます。それについては下記参照。

デフォルトではB6(JIS)サイズの用紙、４０字X１４字程度に収まるように設定してあります。無論自由に変えられます。

.tex以外の拡張子を使いたい場合

何らかの理由で.texの拡張子を使えない場合は-formatを利用して出力フォーマットを指定できます：

aozoraConvert -format tex -supportFiles -full -o 出力ファイル.hoge 入力ファイル.txt

で出力を拡張子に関係なくLaTeXにできます。

青空文庫から入手したzipファイルの利用

入力ファイルとして青空文庫からダウンロードしたzipファイルも使えます：

aozoraConvert -supportFiles -full -o 出力ファイル.tex 入力ファイル.zip

この場合はzipファイルに含まれる画像ファイルも出力されます。いらない場合は-supportFilesを省いてください。

出力の文字範囲をJIS0208かJIS0213に限定する

デフォルトではすべての外字注記を可能なかぎりユニコードに置き換えるようになっています。使用するフォントなどによってこれでは都合が悪いこともあります。その場合は

aozoraConvert -jis0208 -supportFiles -full -o 出力ファイル.tex 入力ファイル.txt

これで外字注記はそのままに残されます。外字の置き換え範囲をJIS0213までに限定したい場合は

aozoraConvert -jis0213 -supportFiles -full -o 出力ファイル.tex 入力ファイル.txt

出力ファイルエンコーディングをSHIFT_JISにする

出力をutf-8ではなく、Shift_JISにしたい場合は

aozoraConvert -sjis -supportFiles -full -o 出力ファイル.tex 入力ファイル.txt

この場合、強制的に-jis0208オプションも選択されます。

LaTeXの断片を出力する

出力をテンプレートに挿入したい場合など、LaTeXのプリアンブルが邪魔になるような場合は

aozoraConvert -o 出力ファイル.tex 入力ファイル.txt

でプリアンブルおよび、\begin{document}、\end{document}なしのLaTeXの断片が出力されます。-jis0208、-jis0213、-supportFiles、-formatのオプションも使えます。

stdoutへの出力

他のツールへパイプしたい場合など出力をstdoutに向かわせたい場合は出力ファイル名を指定しないでください：

aozoraConvert -format tex 入力ファイル.txt

この場合-formatを使ってLaTeX出力を指定することは必須です（怠るとHTMLが出力されます）。

手動修正が必要になるかもしれないLaTeX出力の問題点

ルビのためにpxrubricaパッケージを使用していますが、ルビまたはルビ親が長くなりすぎるとコンパイル時にエラーが出るため、LaTeXへの変換時に長いルビ・ルビ親は分割しています。機械的にやっているので見栄えが悪くなる可能性があります。

挿絵はページ内に収まるようにしていますが、サイズや表示場所は手動での修正が必要かもしれません。特にキャプション付きの画像はfigureに変換されるため実際に組版時に表示される場所が妥当かどうか確認する必要があります。

漢文を検出した場合、kanbunパッケージを利用します。漢文検出に成功した場合、見栄えのいいものが出来ますが、現在の漢文検出は漢文を見逃す事があります。その場合は見苦しくなることもあります。

JIS0213範囲外の字はフォントにグリフが存在しない可能性があります。これはコンパイル時のメッセージでわかります。対処法として、unicodecharパッケージを利用して問題となる字のみ他フォントに置き換えるという手をazcommands.texの最後にコメントアウトした例として示しています。この方法は問題の字を直接入力できることを必要とします。

LaTeXのbox類は行をまたぐことができないので表示に異常が出ることがあります。

入力ファイルが断片の場合

デフォルトでは入力ファイルはファイル冒頭に題と作者名などのメタデータを含む完全な青空文庫のテキストファイルであることが要求されます。入力ファイルがメタデータを含まない場合（例えば動作確認のために注記一覧に出てくる例を入力したいなど）は

aozoraConvert -fragment -o 出力ファイル.tex　入力ファイル.txt

で対応します。-fullオプションを使えばコンパイル可能なファイルも出力できます。

LaTeXでは（今のところ）できないこと

ルビと傍線は両立させられるパッケージが見当たらないので傍線の方を省きます。正確には、傍線のLaTeXコマンドが\bousenだとして、できないのは

\bousen\ruby{...}{...}

です。

\ruby{\bousen{...}}{...}

は問題ありません。これを利用してルビと傍線を両立させることができるか検討中です。

青空文庫形式風テキストファイルへの変換

特に必要がないと思われるかもしれませんが、パーサーの動作確認のために便利です。

青空文庫注記ファイルの出力

aozoraConvert -full -o 出力ファイル.txt 入力ファイル.txt

で青空文庫注記風のテキストファイルが出力されます（出力ファイルの.txt拡張子でテキストファイルが求められていると判断します）。

出力は青空文庫の注記を使いますがデフォルトではEncodingはUTF-8、使える文字種はUnicode全てなので、正規の青空文庫のテキストではありません。あくまでも青空文庫注記風です。

出力をaozoraConvertの入力として使うことももちろん可能です。

-formatオプションの使用

stdoutへの出力や出力ファイルの拡張子に関係なくテキストファイルに変換したい場合は-formatオプションを使えます。たとえばstdoutへの出力は

aozoraConvert -format txt 入力ファイル.txt

JIS0208及びShift_JISで出力したい場合

オプションで-sjisを指定すればShift_JIS及びJIS0208限定になります：

aozoraConvert -sjis -o 出力ファイル.txt 入力ファイル.txt

これでも、もとのテキストファイルとは以下の点で異なります：

前方参照型の注記は使わず、全て［＃開始注記］…［＃終了注記］の形です。
1行だけの字下げの場合、行頭に［＃ｎ字下げ］とのみ書く形式も使わず、［＃ここからｎ字下げ］…［＃ここで字下げ終わり］の形になります。
1行だけの地付きや字上げも同様です。
ルビの親字の範囲を指定する｜は必ず入ります。
冒頭の注記についての説明はありません。

このような違いはありますが、aozora2htmlで問題なくXHTMLに変換でき、結果はもとの青空文庫からの変換と以下の違いを除き同一です：

［＃ｎ字下げ］…　と［＃ここからｎ字下げ］…［ここで字下げ終わり］は同じ意味ですが、aozora2htmlは前者を

<div class="jisage_3" style="margin-left: 3em">...</div>

と変換し、後者を

<div class="jisage_3" style="margin-left: 3em">
...<br />
</div>

と余計な<br />が入る変換をします。これはaozora2htmlの不具合とみなします。同様のことが地付き、字上げでもおこります。この違いを除けばAozoraConvertのテキストファイル出力ともとの青空文庫テキストからのaozora2htmlによる変換はもとのテキストに問題がない限り同一の結果が出ます。

HTML

青空文庫から入手できるXHTMLと似たものができますが、XHTMLではなくHTMLです。headerなどを含む完全なHTMLファイルとcssファイルは以下のコマンドで出力できます：

aozoraConvert -full -supportFiles -o 出力ファイル.html 入力ファイル.txt

cssファイルの名前はaozora.cssです。入力ファイルが青空文庫提供のzipファイルの場合、含まれる画像ファイルも出力されます。

HTML断片の作成

変換結果をテンプレートなどに組み込みたい場合などは-fullオプションを使わないでください：

aozoraConvert -o 出力ファイル.html 入力ファイル.txt

でHTML断片が出力されます。

zipファイルの入力

青空文庫提供のzipファイルも入力ファイルとして使えます。この場合、-supportFilesオプションはaozora.cssの他、zipファイルに含まれる画像ファイルも出力します。

stdoutへの出力

出力ファイル名を指定しなければstdoutに出力されます。この場合は-format htmlと出力タイプを指定しなくてもHTMLで出力されます。

jis0208、jis0213オプション

HTML出力も-jis0208及び-jis0213オプションを使って主力ファイルで使う文字種をコントロールできます。

留意点

HTMLとCSS自体は縦書きにかなりよく対応していますが、ブラウザのレンダリングエンジンの対応は未だまちまちです。

縦書きの行のcenterをどこに置くかはレンダリングエンジンによって違うようです。そのため、割中の見栄えがGecko(Firefox系）とChromium（ChromeやEdgeなど)で若干違います。
CSSではtext-emphasisを利用して傍点をつけることができますが、Chromium 系のhtmlレンダラーでは傍点と親字との距離がありすぎる気がするので、デフォルトで傍点はルビとして扱っています。text-emphaisを使いたければ-rubyForEmph=falseと指定してください。
Geckoは結構きちんとしていますが、カッコなどの向きが内側が欧文の場合などのときに異常になります。

JSON

ASTをJSON形式で直列化します。テキストの構造がどのように理解されているのかや、Go以外の言語で処理を続けたいときなどに使えるかと思います：

aozoraConvert -o 出力ファイル.json 入力ファイル.txt

この場合も-jis0208などのオプションが使えます。

電子ブック

電子ブックの標準のEpub3及びキンドル用の電子ブックも作成できます。この場合入力ファイルはzipファイルであることが必須です（青空文庫提供のもの、あるいは自作の必要な画像ファイルを含むもの）。

Epub3

Epub３の出力は：

aozoraConvert -o 出力ファイル.epub 入力ファイル.zip

リーダーによってはJIS0213を超える範囲の文字を表示できないこともあるので、その場合は

aozoraConvert -jis0213 -o 出力ファイル.epub 入力ファイル.zip

で対処できます。

キンドル用（AZW3フォーマット）

キンドル用の電子ブックはAZW3フォーマットです：

aozoraConvert -o 出力ファイル.azw3 入力ファイル.zip

キンドルは一応JIS0213を超える範囲の文字を表示できるようですが、少なくとも手元にあるPaperWhiteだとフォントがかなり醜いものになることがあります。嫌な場合は-jis0213で外字の置き換えをJIS0213範囲内に抑えられます。

テキストのタイトルと作者名について

青空文庫のテキストファイルは冒頭の数行に題と作者などを書きます。題と作者一人しかいない場合は問題ありませんが、副題があったり、複数の作者、役者が関わっていたりすること機械による判断が難しくなります。そのため出力のサブタイトル、作者名などが正しいか必ず確認する必要があります（タイトルは間違いありません）。

自動修正について

青空文庫のテキストファイルには注記の異常があることがたまにあります。aozoraConvertはそれらのうちよくある問題の自動修正を試みます。自動修正されるものでとくに問題を起こさないと思われるもの：

注記最後尾の余分な空白を削除。
注記内の「おわり」及び［終り」を［終わり」に訂正。
［＃字下げ終わり］を［＃ここで字下げ終わり］に訂正。
［＃ここから割り注］と［＃ここで割り注終わり］を［＃割り注］及び［＃割り注終わり］に訂正。
注記内の［見出］を［見出し］に訂正。
ブロックレベルの開始及び終了注記の前後に改行がない場合、改行を挿入する。
［,］,＃,｜などの特殊記号を外字注記に改める。
前方参照注記内のルビを削除。

重要: 青空文庫の注記の仕様は字下げの入れ子を許しませんが、かなりの数のテキストで入れ子になっていると思われるものがあります。例えば

［＃ここから２字下げ］
・・・
［＃１字下げ］ｘｘｘ
・・・
［＃ここで字下げ終わり］

とある場合、１字下げの行は２字下げの中でさらに１字下げ、つまり３字下げが意図されていると考えるのが自然です。つまり注記の仕様に従えば以下のようになります:

［＃ここから２字下げ］
・・・
［＃３字下げ］ｘｘｘ
・・・
［＃ここで字下げ終わり］

aozoraConvertは字下げの入子が疑われる場合訂正を試みますが、入子が複雑な場合などは必ずしも適切な結果になりません。

自動修正がされた場合は画面に訂正を必要とした行と訂正内容が表示されます。

自動修正をしたくない場合は -strict オプションを指定してください。例えば：

aozoraConvert -strict -o 出力ファイル.txt 入力ファイル

すべての出力フォーマットに対応します。

aozora.orihasam.comについて

aozora.orihasam.comは青空文庫のテキストの検索や変換が手軽にできるように、上で紹介したaozoraConvertの下敷きとなっているGoライブラリを利用したウェブアプリです。

各本のページにいくつかのダウンロードオプションが示されますが、それらについて説明します。各ページの右上にあるメニューで外字の置き換えをJIS0213範囲内に留めるよう設定できます。ダウンローダされるファイルの名前は青空文庫提供のtxtファイルの名前に由来します。

利点としてタイトルなどを青空文庫のデータベースを利用して自動的に修正するので、自分で確認、修正することがない点があります。

Epub。ダウンロードされるのは

aozoraConvert -o 出力ファイル.epub 入力ファイル.zip

に相当するもの。JIS0213内での変換を設定してある場合は　-jis0213オプションもたされます。

Azw3。ダウンロードされるのは

aozoraConvert -o 出力ファイル.azw3 入力ファイル.zip

に相当するもの。

LaTeX。ダウンロードされるのは

aozoraConvert -supportFiles -full -o 出力ファイル.tex 入力ファイル.zip

の出力をzip形式で圧縮したもの。

latex-samplesフォルダ内にaozora.orihasam.comで作ったLaTeXファイルをコンパイルしたものがいくつか置いてあります。

HTML。ダウンロードされるのは

aozoraConvert -supportFiles -full -o 出力ファイル.html 入力ファイル.zip

の出力をzip形式で圧縮したもの。

-JSON。ダウンロードされるのは

aozoraConvert -supportFiles -o 出力ファイル.json 入力ファイル.zip

の出力をzip形式で圧縮したもの。

テキストファイル。ダウンロードされるのは

aozoraConvert -supportFiles -o 出力ファイル.txt 入力ファイル.zip

の出力をzip形式で圧縮したもの。

もとの青空文庫テキストファイル。通常は青空文庫から入手できるテキストファイルと同一ですが、なかには上記の自動修正では対応しきれないものがあります。その場合は手動修正してあるテキストファイルがzipパッケージ内にあります。もとのテキストは拡張子に _original と足してあります。また修正版とオリジナルの違いを示す差異ファイルも .diff と拡張子のついたファイルとしてパッケージ内にあります。
手動修正したファイルのリストはaozoraConvertのソースコードと同じフォルダ内の manual_corrections.csv で確認できます。これは青空文庫提供のデータベースファイルから訂正してあるもののみを抽出したリストです。

Documentation ¶

Overview ¶

青空文庫注記のファイルから抽象構文木（AST）を抽出し、そこから各種形式に直列化するライブラリ及びコマンドラインツールです。今のところ、出力はLaTeX、HTML（及びにそれをもとにしたEpubとAZW3）、青空文庫形式風テキストファイル、JSONに対応しています。AST経由なので他の直列化形式に対応することも比較的簡単です。

Index ¶

Variables
func ConvertMKT(mkt string) (s string, err error)
func ListProcessedTokens(text string) string
func ListRawTokens(text string) string
func RenderAozoraText(ast *Node, w *strings.Builder) (err error)
func RenderHTML(ast *Node, w *strings.Builder) (err error)
func RenderHTMLFull(ast *Node, w *strings.Builder) (err error)
func RenderJSON(ast *Node, w *strings.Builder) (err error)
func RenderLaTeX(ast *Node, w *strings.Builder) (err error)
func RenderLaTeXFull(ast *Node, w *strings.Builder) (err error)
func RenderNavHTML(ast *Node, w *strings.Builder) (err error)
func Serialize(n *Node, w *strings.Builder, ...)
func SerializeDescendants(n *Node, w *strings.Builder, ...)
func SetCorrectionLog(out io.Writer)
func SetFragment(v bool)
func SetFullUnicode()
func SetJIS0208()
func SetJIS0213()
func SetMessageLog(out io.Writer)
func SetParsable(v bool)
func SetRubyEmph(v bool)
func SetStrict(v bool)
func SetVerbose()
func ToHiragana(r rune) rune
func ToSJIS(text string) string
func ToUTF8(data []byte) string
type Book
- func NewBook() *Book
- func NewEbookFromZip(dz []byte) (bk *Book)
- func (b *Book) EmbedImages()
- func (b *Book) GetURI() string
- func (b *Book) RenderAZW3() []byte
- func (b *Book) RenderEpub() []byte
- func (b *Book) RenderMonolithicHTML() []byte
- func (b *Book) RenderPackage(format string) []byte
- func (b *Book) SetCreator(c string)
- func (b *Book) SetMetadataFromText()
- func (b *Book) SetPublisher(p string)
- func (b *Book) SetTitle(t string)
- func (b *Book) SetURI(l string)
type CharTypeID
- func CharType(r rune) CharTypeID
- func (i CharTypeID) String() string
type Node
- func AST(data string) (n *Node, err error)
- func (n *Node) AddContributor(name string)
- func (n *Node) Children() []*Node
- func (n *Node) ClearChildren()
- func (n *Node) ClearMetadata()
- func (n *Node) HasChild() bool
- func (n *Node) IsLastSibling() bool
- func (n *Node) NestingLevel() int
- func (n *Node) Parent() *Node
- func (n *Node) Remove()
- func (n *Node) SetAttr(key string, val string)
- func (n *Node) SetSubtitle(subtitle string)
- func (n *Node) SetTitle(title string)
- func (n *Node) Siblings() []*Node
- func (n *Node) String() string

Constants ¶

This section is empty.

Variables ¶

View Source

var AozoraCSS string

AozoraCSS is the default CSS to be used with HTML files rendered through aozoraconvert.

View Source

var LaTeXdefinitions string

LaTeXdefinitions contains the default definitions of commands and environments for LaTeX output.

Functions ¶

func ConvertMKT ¶

func ConvertMKT(mkt string) (s string, err error)

ConvertMKT returns the unicode string corresponding to the JIS code point given in the 面区点(men-ku-ten) format. mkt needs to be formatetted as a string of the form "d-dd-dd".

func ListProcessedTokens ¶

func ListProcessedTokens(text string) string

ListProcessedTokens lists the tokens after preparation for parsing.

func ListRawTokens ¶

func ListRawTokens(text string) string

ListRawTokens lists the tokens returned through the initial tokenization.

func RenderAozoraText ¶

func RenderAozoraText(ast *Node, w *strings.Builder) (err error)

RenderAozoraText renders ast as a string formatted in the style of Aozorabunko.

func RenderHTML ¶

func RenderHTML(ast *Node, w *strings.Builder) (err error)

RenderHTML renders ast as an html fragment.

func RenderHTMLFull ¶

func RenderHTMLFull(ast *Node, w *strings.Builder) (err error)

RenderHTMLFull renders ast as a full HTML file including doctype declararation and head element.

func RenderJSON ¶

func RenderJSON(ast *Node, w *strings.Builder) (err error)

RenderJSON renders ast in JSON format.

func RenderLaTeX ¶

func RenderLaTeX(ast *Node, w *strings.Builder) (err error)

RenderLaTeX renders ast as a LaTeX fragment.

func RenderLaTeXFull ¶

func RenderLaTeXFull(ast *Node, w *strings.Builder) (err error)

RenderLaTeXFull renders ast as a whole compileable LaTeX document. You will need to use the uplatex engine.

func RenderNavHTML ¶

func RenderNavHTML(ast *Node, w *strings.Builder) (err error)

RenderNavHTML returns the table of contents for the text given by ast. TOC is formatted as an html ordered list.

func Serialize ¶

func Serialize(n *Node, w *strings.Builder, ingressFunc, egressFunc func(*Node, *strings.Builder))

Serialize AST given by n as a string. ingressFunc controls output when entering a node, egressFunc controls the output when exiting a node.

func SerializeDescendants ¶

func SerializeDescendants(n *Node, w *strings.Builder, ingressFunc, egressFunc func(*Node, *strings.Builder))

SerializeDescendants leaves out the top node n in serializing.

func SetCorrectionLog ¶

func SetCorrectionLog(out io.Writer)

SetCorrectionLog sets the output destination regarding automatic correction made while parsing in a non-strict way.

func SetFragment ¶

func SetFragment(v bool)

SetFragment controls whether input should be treated as a full aozorabunko text or just a fragment.

func SetFullUnicode ¶

func SetFullUnicode()

SetFullUnicode sets output to allow the full range of unicode codepoints.

func SetJIS0208 ¶

func SetJIS0208()

SetJIS0208 sets output to JIS0208.

func SetJIS0213 ¶

func SetJIS0213()

SetJIS0213 sets output to JIS0213.

func SetMessageLog ¶

func SetMessageLog(out io.Writer)

SetMessageLog sets the output destination for general messages from parser and renderer.

func SetParsable ¶

func SetParsable(v bool)

SetParsable sets output to a parsable form if v is true. Only useful for text output.

func SetRubyEmph ¶

func SetRubyEmph(v bool)

SetRubyEmph sets whether ruby are handled as text-emphasis or as ruby (only relevant for (X)HTML output).

func SetStrict ¶

func SetStrict(v bool)

SetStrict tells the parser whether fixable errors in the input file should abort the parsing.

func SetVerbose ¶

func SetVerbose()

func ToHiragana ¶

func ToHiragana(r rune) rune

ToHiragana converts r to hiragana (iff. r is katakana)

func ToSJIS ¶

func ToSJIS(text string) string

ToSJIS converts text to ShiftJIS encoding. Also converts to DOS line endings.

func ToUTF8 ¶

func ToUTF8(data []byte) string

ToUTF8 converts from ShiftJIS to UTF8. Also converts to unix line endings. Note that input is []byte, not string.

Types ¶

type Book ¶

type Book struct {
	Title, Creator, Publisher string
	Files                     []fileData
	UUID                      string
	Body                      *Node
	URI                       string
	Images                    []records.ImageRecord
	CSS                       string
	Hash                      string
	DateMod                   string
	TxtFileName               string
}

Book represents a book from Aozora Bunko

func NewBook ¶

func NewBook() *Book

NewBook returns a new Book.

func NewEbookFromZip ¶

func NewEbookFromZip(dz []byte) (bk *Book)

NewEbookFromZip returns Book from dz which must be zip archive containing the Aozorabunko text and any needed graphics files.

func (*Book) EmbedImages ¶

func (b *Book) EmbedImages()

EmbedImages embeds the image data in the relevant nodes in the ast of b.l

func (*Book) GetURI ¶

func (b *Book) GetURI() string

GetURI returns the the path of the book within Aozora Bunko's file structure.

func (*Book) RenderAZW3 ¶

func (b *Book) RenderAZW3() []byte

RenderAZW3 returrns b as an AZW3 file

func (*Book) RenderEpub ¶

func (b *Book) RenderEpub() []byte

RenderEpub returns b as a zipped Epub file.

func (*Book) RenderMonolithicHTML ¶

func (b *Book) RenderMonolithicHTML() []byte

RenderMonolithicHTML returns the book with the images embedded into the HTML.

func (*Book) RenderPackage ¶

func (b *Book) RenderPackage(format string) []byte

RenderPackage returns a zip package for the given format

func (*Book) SetCreator ¶

func (b *Book) SetCreator(c string)

SetCreator sets the creator to c.

func (*Book) SetMetadataFromText ¶

func (b *Book) SetMetadataFromText()

SetMetadataFromText sets metadata from text.

func (*Book) SetPublisher ¶

func (b *Book) SetPublisher(p string)

SetPublisher sets the publisher to p.

func (*Book) SetTitle ¶

func (b *Book) SetTitle(t string)

SetTitle sets the title to t.

func (*Book) SetURI ¶

func (b *Book) SetURI(l string)

SetURI sets the path of book within Aozora Bunko's file structure.

type CharTypeID ¶

type CharTypeID int

CharTypeID represents character types.

const (
	Symbol CharTypeID = 1 << iota //Symbol captures everything that isn't captured by the other categories.
	Hiragana
	Katakana
	Kanji
	Whitespace
	Punctuation
	Roman
)

Define charater types.

func CharType ¶

func CharType(r rune) CharTypeID

CharType returns the character type of r

func (CharTypeID) String ¶

func (i CharTypeID) String() string

type Node ¶

type Node struct {
	Attr map[string]string
	// contains filtered or unexported fields
}

Node represents a node in an AST.

func AST ¶

func AST(data string) (n *Node, err error)

AST returns the root node of the AST for data. data should be a properly formatted Aozorabunko text. If not, it will probably panic.

func (*Node) AddContributor ¶

func (n *Node) AddContributor(name string)

AddContributor adds a contributor

func (*Node) Children ¶

func (n *Node) Children() []*Node

Children returns the children of n in order as a slice, starting with the the first child of n.

func (*Node) ClearChildren ¶

func (n *Node) ClearChildren()

ClearChildren removes all children of n

func (*Node) ClearMetadata ¶

func (n *Node) ClearMetadata()

ClearMetadata clears the metadata node of n.

func (*Node) HasChild ¶

func (n *Node) HasChild() bool

HasChild returns whether n has a child node

func (*Node) IsLastSibling ¶

func (n *Node) IsLastSibling() bool

IsLastSibling returns whether or not n has any further siblings.

func (*Node) NestingLevel ¶

func (n *Node) NestingLevel() int

NestingLevel returns the nesting level of n

func (*Node) Parent ¶

func (n *Node) Parent() *Node

Parent returns the parent node of n. nil if n is top node.

func (*Node) Remove ¶

func (n *Node) Remove()

Remove removes the node. After remove, n will have no siblings, and no parent.

func (*Node) SetAttr ¶

func (n *Node) SetAttr(key string, val string)

SetAttr sets an Attr of n with the key-val pair.

func (*Node) SetSubtitle ¶

func (n *Node) SetSubtitle(subtitle string)

SetSubtitle sets the subtitle.

func (*Node) SetTitle ¶

func (n *Node) SetTitle(title string)

SetTitle sets the title of n to title. n must have a child node with Attr["type"]=="metadata"

func (*Node) Siblings ¶

func (n *Node) Siblings() []*Node

Siblings returns a slice of all siblings including and after n.

func (*Node) String ¶

func (n *Node) String() string

String returns the attributes of n.

Directories ¶

Path	Synopsis
cmd
aozoraConvert command
mobi Package mobi implements writing KF8-style formatted MOBI and AZW3 books.	Package mobi implements writing KF8-style formatted MOBI and AZW3 books.
jfif Package jfif implements writing JPEG images with fixed JFIF header.	Package jfif implements writing JPEG images with fixed JFIF header.
pdb Package pdb implements reading and writing PalmDB databases.	Package pdb implements reading and writing PalmDB databases.
records Package records contains facilities to create MOBI formatted books.	Package records contains facilities to create MOBI formatted books.
types Package types contains types and constants to create MOBI formatted books.	Package types contains types and constants to create MOBI formatted books.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL