Skip to main content

Notes on regex sed, bash DOM html, meta tag

Using bash tool sed, awk, grep, vim ... to edit multiple file. Case study DOM tag <html DOCTYPE, meta, charset...

https://unix.stackexchange.com/questions/26284/how-can-i-use-sed-to-replace-a-multi-line-string
Sed cheatsheet
https://gist.github.com/asenchi/2291903

HTML meta tags (recommend way ?)
https://www.quackit.com/html_5/tags/html_meta_tag.cfm


Example
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml " xml:lang="ja" lang="ja">
<head>
<meta http-equiv="Content-Type" content="text/html" charset="EUC-JP" />
<meta http-equiv="Content-Script-Type" content="text/javascript" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta name="description" content="content here" />
<title>hoge</title>
</head>
<body>
</body>
</html>

Expected:

<!DOCTYPE html>
<html lang="ja">
<head>
<meta charset="EUC-JP">
<title>hoge</title>
<meta name="description" content="content here">
</head>
<body>
</body>
</html>

1. With <!DOCTYPE html : My idea is that delete entire line after match "DOCTYPE html and append ">" close tag.
$  sed -i 's/\DOCTYPE html.*/DOCTYPE html>/' filename

For filename, I use simple way to find . -name "*.html" |xargs grep -rli "DOCTYPE" $1 to get list of html files contain DOCTYPE to edit.
And then use column edit (Vim or Sublime text) to add prefex sed .... to list HTML files.
I think I can use "one line" command solution (use pipe, chain etc.) but it seem take time.

Special case:

-<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

=> +<!DOCTYPE html>
Because of two line so sed grex have a issue here. Fortunately there are only few of them so I can manually fixed it. Like many other automate task, there are always exceptions, so be careful to double check the Input / Ouput...
And sed, aws, regex used in ultimately many time. In some previous post (should linked here) I also mentioned.


2. <html xmlns="http://www.w3.org/1999/xhtml " xml:lang="ja" lang="ja"> =>
<html lang="ja">

In this case, similar to previous line, I use sed to remove front of match (before lang=...). But it seem there are no good/working answer out there on the Internet, so I try working around by:
- Append first part of line before match with keyword 'delete-me' (should unique), we don't need this part and can be remove later.
- Preend <html to second part (before match words) and keep tail of  second part line.

$ sed -i 's/ lang=/delete-me\n<html&/g'  filename.html

& here mean keyword ' lang=' or tail of second part (I will double check this).
And then sed to remove temporary dump lines:
$ sed -i '/delete-me/d'  filename.html

3. Remove css, js meta tag is straightforward
$ sed -i '/meta http-equiv="Content-Style/d' filename

Replace Content-Style with Content-Script to remove JS meta tag.
Looked at keyword 'meta http-equiv', without meta we will be missed with <script> tag if we only use keyword "Content-Stype"...

4. <meta http-equiv="Content-Type" content="text/html" charset="EUC-JP" />
or
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />  (most common case).

=> <meta charset="EUC-JP">

This case seem similar with <html lang="en"> convert above.

$  sed -i 's/; charset=UTF/delete-me&\n<meta charset=\"UTF/g' filename

Becase of this keyword "charset=" could be mixed in <script> or another tag:
<script ... charset="..."> so, we have to added '; ' in the keyword.

The most occurrence case is content="...; charset=...".
I think I made a small mistake here. I should not or do not need UTF in the keyword. The reason is that when I firstly copied some command from another tut, '&' in sed regex placed at last '&/g'. This one is for append or preend keyword (or tail ?) part before or after keyword.
So I will try to run without UTF later. (But I think I too lazy to try again :)

Anyway, it work.

We need another sed update.
$ sed -i 's/; charset=\"/delete-me&\n<meta charset=/g'  file2edit.html

This time for all "; charset=" match. And then run sed to remove all dump line contain 'delete-me'.


At first, I think we can only use one sed for this case, apply to keyword '; charset=' or simple ' charset=' if we ignore <script> tag since it appear only a few and can be manually edited.
But the real reason this second sed run is required is because of, ie. this case:
<meta http-equiv="Content-Type" content="text/html" charset="EUC-JP" />

You can see, we do not have "content='... ; charset=...". Focus on '; ' character. We have to remove whole front part of line before keyword. So second sed run is needed.



Sed change multiple files
https://stackoverflow.com/questions/10445934/change-multiple-files

Diff two text file to get different, only get different line. The idea is that sort two file first and then get different.
https://stackoverflow.com/questions/10708300/compare-two-files-ignoring-order

Comments