How to migrate Tweakers Tweakblog content to Hugo

11 Dec, 2022 (Updated: 11 Feb, 2023)

Tweakers.net (tweakers.net) unfortunately decided to sunset their tweakblogs (tweakers.net). Here I document how I port my tweakblogs content (tweakblogs.net) to Hugo (gohugo.io).

Mirror complete blog ¶

While the export is useful, it doens’t include images. Here’s two unsuccessful attempts to provide a mirror command including (linked) images:

# This queries too many domains, I'd prefer it to stay on tweakers.net and tweakblogs.net but couldnt find that option
httrack --ext-depth=1 "https://atomstar.tweakblogs.net/blog/19440/particulates-kill-build-your-sensor-now"

# This queries too many pages
wget --domains tweakers.net tweakblogs.net -E -H -k -K -p --recursive "https://atomstar.tweakblogs.net/blog/19440/particulates-kill-build-your-sensor-now"

Tweakblogs are down now so this is not useful anymore.

Inspect export content ¶

First check the whole file:

jq . tweakblog-export.json

Now let’s inspect the top-level keys

jq -r '. | keys[]' tweakblog-export.json

beschrijving
beschrijvinghtml
blognaam
categorieën
customCss
ingangsdatum
laatstewijzigingsdatum
ondertitel
posts
titel

Stuff is stored in .posts, with tekst containing UBB formatted code and teksthtml the same as HTML formatted.

jq -r '.posts[0] | keys[]' tweakblog-export.json

categorieën
laatstewijzigingsdatum
publicatiedatum
tekst
teksthtml
titel
url

Get post count:

numposts=$(jq -r '.posts | length' tweakblog-export.json)
echo $numposts
34

Get titles of posts

numposts=$(jq -r '.posts | length' tweakblog-export.json)

for i in $(seq 1 1 ${numposts}); do
	echo -n $i; jq -r '.posts['$i'].titel' tweakblog-export.json
done

jq doesn’t like escaped keys (like categorie\u00ebn), so we replace it with sed before processing.

sed 's/\"categorie\\\u00ebn\"/\"categorieen\"/g' tweakblog-export.json > tweakblog-export-ascii.json

Convert to Markdown ¶

Now let’s try to convert the post to Markdown. Initially I thought UBB might be easier, but there’s more HTML to MD converters so I settled on that conversion.

Some examples:

Show 1 post as example, start with populating the header:

bash
# Get filename from URL
filename=$(jq -r '.posts[0].url | split("/")[-1]' tweakblog-export.json).md

# Alternatively urlencode ourselves through Python:
# jq -r '.posts[0].titel' tweakblog-export.json | python3 -c "import urllib.parse; print(urllib.parse.quote_plus(input().lower()))"

echo "---" > ${filename}
jq -r '.posts[0] | "title: \"\(.titel)\""' tweakblog-export.json >> ${filename}
# Get publish date and lastmod date, reformat to ISO time with gsub().
jq -r '.posts[0] | "date: \(.publicatiedatum | gsub(" ";"T"))+01:00"' tweakblog-export.json >> ${filename}
# We test if the laatstewijzigingsdatum key exists using the -e flag of jq before writing to the export file 
jq -er '.posts[0].laatstewijzigingsdatum' tweakblog-export-ascii.json && jq -r '.posts[0] | "lastmod: \(.laatstewijzigingsdatum | gsub(" ";"T"))+01:00"' tweakblog-export.json >> ${filename}
# use ASCII export version instead of UTF because I can't decode the key 'categorieën' (?)
jq -er '.posts[0].categorieen' tweakblog-export-ascii.json && jq -r '.posts[0] | "tags: \(.categorieen)"' tweakblog-export-ascii.json >> ${filename}
echo "---" >> ${filename}

Now convert body from html to Markdown, already post-processing some tags

jq -r '.posts[33].teksthtml' tweakblog-export.json | sed -E 's|<(/)?h4|<\1h1|g' | sed -E 's|<(/)?h5|<\1h2|g' | sed -E 's|<(/)?h6|<\1h3|g' | sed -E 's|\[more\]|<!-- more -->|g' | html2text --no-wrap-links --body-width 0 --mark-code | gawk -v RS= '{ gsub(/\[code\][0-9 \n]+\[\/code\]/, "removed"); print}' | sed -E 's|\[.?code\]|```|g'  >> ${filepath}

Post-processing applied:

Convert titleheaders: h4 –> h1, h5 –> h2, h6 –> h3
- echo "<h4> hello h4</h4>" | sed -E 's|<(/)?h4|<\1h1|g'
Fix [more] –> 
- echo '[more] --> ' | sed -E 's|\[more\]||g'
Remove [code] [/code] with only numerical between (these were line numbers)
- echo '[code] 1 2 3 4 [/code]' | gawk -v RS= '{ gsub(/\[code\][0-9 \n]+\[\/code\]/, "removed"); print}'
Fix [(?)code] –> ```
- echo '[code] echo "hello world" [/code]' | sed -E 's|\[.?code\]|```|g'

Get images in Markdown ¶

This is a bit more tricky, and also depends a bit on your markdown/Hugo setup.

E.g. this is the raw output

[![BigSensorThing mk1 final assembly left](https://tweakers.net/i/MQSfbu_nx7hoaM49pcvQFhdAvA4=/234x176/filters:strip_icc\(\):strip_exif\(\)/f/image/sAzk9VFyAVlUCMnYpccATE3s.jpg?f=fotoalbum_medium)](https://tweakers.net/i/rl6WaePdlWDXlp7Ffwyh9eQfa5Y=/full-fit-in/4920x3264/filters:max_bytes\(3145728\):no_upscale\(\):strip_icc\(\):fill\(white\):strip_exif\(\)/f/image/sAzk9VFyAVlUCMnYpccATE3s.jpg?f=user_large)

[![RJ45 wall socket with T568A/B color coding](https://tweakers.net/ext/f/u7suFCCv9UE5j7SKkYSLVKsM/medium.jpg)](https://tweakers.net/ext/f/u7suFCCv9UE5j7SKkYSLVKsM/full.jpg)

Or as HTML:

<img src='https://tweakers.net/i/Ph9XggAWe4za5KHgdaG_m9hQnGs=/234x176/filters:strip_icc():strip_exif()/f/image/p4to7QsT4oLYAqL9bHcUblWI.jpg?f=fotoalbum_medium' width='234' height='176' alt='BigSensorThing mk1 final assembly front' />[](https://tweakers.net/i/GjLa8pGnWlTDxQUfeA-LlpCKAuU=/full-fit-in/4920x3264/filters:max_bytes\(3145728\):no_upscale\(\):strip_icc\(\):fill\(white\):strip_exif\(\)/f/image/p4to7QsT4oLYAqL9bHcUblWI.jpg?f=user_large)<img src='https://tweakers.net/i/MQSfbu_nx7hoaM49pcvQFhdAvA4=/234x176/filters:strip_icc():strip_exif()/f/image/sAzk9VFyAVlUCMnYpccATE3s.jpg?f=fotoalbum_medium' width='234' height='176' alt='BigSensorThing mk1 final assembly left' />[](https://tweakers.net/i/rl6WaePdlWDXlp7Ffwyh9eQfa5Y=/full-fit-in/4920x3264/filters:max_bytes\(3145728\):no_upscale\(\):strip_icc\(\):fill\(white\):strip_exif\(\)/f/image/sAzk9VFyAVlUCMnYpccATE3s.jpg?f=user_large)

and in my Hugo setup I use constructs like below to tile images and present responsive designs.

{{< rawhtml >}}
<div class="row">
  <div class="col3">
    {{< fig src="images/diy-jaga-dbe-dbh/IMG_4664-web.jpg" sizes="(min-width: 720px) 240px, (min-width: 480px) 360px, 100vw" caption="Marking out fan holes on aluminium frame" >}}
  </div>
  <div class="col3">
  	{{< fig src="images/diy-jaga-dbe-dbh/IMG_4670-web.jpg" sizes="(min-width: 720px) 240px, (min-width: 480px) 360px, 100vw" caption="Drilling out fan holes & sawing out fan blade area for improved airflow" >}}
  </div>
  <div class="col3">
  	{{< fig src="images/diy-jaga-dbe-dbh/IMG_4673-web.jpg" sizes="(min-width: 720px) 240px, (min-width: 480px) 360px, 100vw" caption="End result of two alumiunum frame with sawed out fan blade area" >}}
  </div>
</div>
{{< /rawhtml >}}

Because this was getting a bit too complex with just pipes, I used below python script to download images and update links.

cat << 'EOF' > ./parse_images.py
import requests
import re
import hashlib
import os

from argparse import ArgumentParser

HUGODIR="/path/to/hugo/"

parser = ArgumentParser()
parser.add_argument('filename')
args = parser.parse_args()

# "/path/to/hugo/content/blog/post-name.md"
postpath = args.filename
postimgdir = os.path.splitext(os.path.basename(postpath))[0]


with open(postpath) as fd:
	postdata = fd.read()

# Regexp to find images, which look like [![text](imgurl)](linkurl)
# https://stackoverflow.com/questions/8435368/python-regex-ignore-escaped-character
# \[\!\[(.+?)\] --> match [![text]
# \((.+?)\)\]\( --> match (text)]
# (([^\)\\]|\\\(|\\\))+?)\) --> match 'not)\' OR '\(' OR '\)' and finally ')'

# Next: get URL, download as file, update links in text, and insert in text
# {{< fig src="images/particulates-kill-build-your-sensor-now/1234038.jpg" sizes="(min-width: 720px) 240px, (min-width: 480px) 360px, 100vw" alt="Finished BigSensorthing viewed from the front" >}}

# URLS look like: 
# - https://tweakers.net/ext/f/ATxD1d1Zjm7tRJhONsaSxbCg/full.jpg
# - https://tweakers.net/fotoalbum/image/5gf3YiISLUx4ryKYz6qePHay.png
# - https://tweakers.net/i/y8VQJY-LPiejkhRx1CuvKdq6qD0=/full-fit-in/4920x3264/filters:max_bytes(3145728):no_upscale():strip_icc():fill(white):strip_exif()/f/image/jpMEcpVLvo5lhd4x0sdBEbpP.jpg?f=user_large


# First find all image tags and associated hyperlinks, remove escape quotes, download these as the hex digest of the url (to get manageable filenames), and store on disk
imgmatch = re.findall(r'\[\!\[(.+?)\]\((.+?)\)\]\((([^\)\\]|\\\(|\\\))+?)\)', postdata)

# Now convert markdown images with Hugo images
# /*
postdata_new = re.sub(r'\[\!\[(.+?)\]\((.+?)\)\]\((([^\)\\]|\\\(|\\\))+?)\)', "{{< fig src=\"\\3\" sizes=\"(min-width: 720px) 240px, (min-width: 480px) 360px, 100vw\" >}}", postdata)
# */

# Now loop over the matches in the original data, download files, and update the hyperlink in the new data
for match in imgmatch:
	# Unpack match, remove escapes from linkurl, request data
	alt, imgurl, linkurl, __ = match
	# Remove quotes from the link, and only take the part of the string until the first space (sometimes happens in borked input?)
	linkurl_clean = linkurl.replace("\\","").split(' ')[0]
	reqdata = requests.get(linkurl_clean)

	# Make filename, depending on image type
	if (reqdata.headers['Content-type'] == "image/png"):
		filename = hashlib.md5(linkurl_clean.encode()).hexdigest()[-8:]+".png"
	elif (reqdata.headers['Content-type'] == "image/jpeg"):
		filename = hashlib.md5(linkurl_clean.encode()).hexdigest()[-8:]+".jpg"
	else:
		print("Unknown image type, url {} to be stored as {} has type {}. Probably malformed url, trying to continue".format(linkurl_clean, filename, reqdata.headers['Content-type']))

	# print("Processing url {} to img {}".format(linkurl_clean, filename))

	filedir = HUGODIR+"images/"+postimgdir
	try:
		os.mkdir(filedir)
	except FileExistsError:
		# print("Path already existed, continuing")
		pass
	filepath = os.path.join(filedir, filename)

	img_data = reqdata.content
	with open(filepath, 'wb') as fd:
		fd.write(img_data)

	# Now replace the filename of the images with the digest filename
	imgpath = os.path.join("images",postimgdir,filename)
	postdata_new = postdata_new.replace(linkurl,imgpath)

# Update file
with open(postpath, "w") as fd:
	fd.write(postdata_new)
EOF

Test it:

python parse_images.py ${filepath}

Putting it together ¶

Now that we’ve tested one post we can bring it together and loop over all posts (almost).

postid=33
filename=$(jq -r '.posts['${postid}'].url | split("/")[-1]' tweakblog-export.json).md
filepath="/path/to/hugo/content/blog/"${filename}
echo "Parsing ${postid} -- ${filename}"

Make post header

echo "---" > ${filepath}
jq -r '.posts['${postid}'] | "title: \"\(.titel)\""' tweakblog-export.json >> ${filepath}
# Get publish date and lastmod date, reformat to ISO time with gsub().
jq -r '.posts['${postid}'] | "date: \(.publicatiedatum | gsub(" ";"T"))+01:00"' tweakblog-export.json >> ${filepath}
# We test if the laatstewijzigingsdatum key exists using the -e flag of jq before writing to the export file 
jq -er '.posts['${postid}'].laatstewijzigingsdatum' tweakblog-export-ascii.json && jq -r '.posts['${postid}'] | "lastmod: \(.laatstewijzigingsdatum | gsub(" ";"T"))+01:00"' tweakblog-export.json >> ${filepath}
# use ASCII export version instead of UTF because I can't decode the key 'categorieën' (?)
jq -er '.posts['${postid}'].categorieen' tweakblog-export-ascii.json && jq -r '.posts['${postid}'] | "tags: \(.categorieen)"' tweakblog-export-ascii.json >> ${filepath}
echo "---" >> ${filepath}

Generate post content

# Prep headers: h4 --> h1, h5 --> h2, h6 --> h3
#   echo "<h4> hello h4</h4>" | sed -E 's|<(/)?h4|<\1h1|g'
# Fix [more] --> <!-- more -->
#   echo '[more] --> <!-- more -->' | sed -E 's|\[more\]|<!-- more -->|g'
# Remove [code] [/code] with only numerical between
#   echo '[code] 1 2 3 4 [/code]' | gawk -v RS= '{ gsub(/\[code\][0-9 \n]+\[\/code\]/, "removed"); print}'
# Convert multi/single line code tags [(?)code]\s$ --> ``` and [/code]
#   echo '[code] echo "hello world" [/code]' | sed -E 's|\[.?code\]|```|g'
# https://askubuntu.com/questions/650114/filter-out-html-tag-and-replace-with-other-html-tags-using-sed

# Convert all html to markdown, post-process as guided above
jq -r '.posts['${postid}'].teksthtml' tweakblog-export.json | sed -E 's|<(/)?h4|<\1h1|g' | sed -E 's|<(/)?h5|<\1h2|g' | sed -E 's|<(/)?h6|<\1h3|g' | sed -E 's|\[more\]|<!-- more -->|g' | html2text --no-wrap-links --body-width 0 --mark-code | gawk -v RS= '{ gsub(/\[code\][0-9 \n]+\[\/code\]/, "removed"); print}' | sed -E 's|\[.?code\]|```|g'  >> ${filepath}

Now glue together and run in a loop. Whoosh!

numposts=$(jq -r '.posts | length' tweakblog-export.json)

for postid in $(seq 1 1 ${numposts}); do
	# filename=$(jq -r '.posts['${postid}'].url | split("/")[-1]' tweakblog-export.json).md

	# Basic URL encoding: replace all non-alphanumeric by uncomplicated dashes: "${filename//[^[:alnum:]]/-}" or with gsed
	# If I use urlencoded strings as filenames and directory names (e.g. including %29 %28), this somehow leads to 404.
	filename=$(jq -r '.posts['${postid}'].titel | ascii_downcase' tweakblog-export.json | gsed 's/[^a-zA-Z0-9]\+/-/g').md
	filepath="/path/to/hugo/content/blog/"${filename}
	echo "Parsing ${postid} -- ${filename}"

	echo "---" > ${filepath}
	jq -r '.posts['${postid}'] | "title: \"\(.titel)\""' tweakblog-export.json >> ${filepath}
	# Get publish date and lastmod date, reformat to ISO time with gsub().
	jq -r '.posts['${postid}'] | "date: \(.publicatiedatum | gsub(" ";"T"))+01:00"' tweakblog-export.json >> ${filepath}
	# We test if the laatstewijzigingsdatum key exists using the -e flag of jq before writing to the export file 
	jq -er '.posts['${postid}'].laatstewijzigingsdatum' tweakblog-export-ascii.json >>/dev/null && jq -r '.posts['${postid}'] | "lastmod: \(.laatstewijzigingsdatum | gsub(" ";"T"))+01:00"' tweakblog-export.json >> ${filepath}
	# use ASCII export version instead of UTF because I can't decode the key 'categorieën' (?)
	jq -er '.posts['${postid}'].categorieen' tweakblog-export-ascii.json >>/dev/null && jq -r '.posts['${postid}'] | "tags: \(.categorieen)"' tweakblog-export-ascii.json >> ${filepath}
	echo "---" >> ${filepath}

	jq -r '.posts['${postid}'].teksthtml' tweakblog-export.json | sed -E 's|<(/)?h4|<\1h1|g' | sed -E 's|<(/)?h5|<\1h2|g' | sed -E 's|<(/)?h6|<\1h3|g' | sed -E 's|\[more\]|<!-- more -->|g' | html2text --no-wrap-links --body-width 0 --mark-code | gawk -v RS= '{ gsub(/\[code\][0-9 \n]+\[\/code\]/, ""); print}' | sed -E 's|\[.?code\]|```|g'  >> ${filepath}

	python parse_images.py ${filepath}
done

There might be a few errors like below, which you need to fix manually

Parsing 4 -- connecting-sensors-to-rpi-2-3-.md
Unknown image type, url https://tweakers.net/ext/f/FXBzeX1gMpas2FnRDNPrTkb0/full.jpg%22Landisgyr%20ultraheat%20meter%22 has type text/html; charset=UTF-8. Probably malformed url, trying to continue
Parsing 29 -- saving-2-0gj-yr-heating-by-upgrading-heat-exchanger.md
Unknown image type, url https://tweakers.net/fotoalbum/image/NLOjGR69ocOjRCP8vsXbSLkv.jpg has type text/html. Probably malformed url, trying to continue
Unknown image type, url https://tweakers.net/fotoalbum/image/qb3IOXvCQZ07yGI2JOK5oz7c.png has type text/html. Probably malformed url, trying to continue

Goodbye all, hope this helps!

#Css #Html #Hugo #Markdown